Open hz20091942 opened 2 months ago
Is this on the latest patch version of vLLM?
looks like a machine / hardware problem. make sure you try it in different machines.
Is this on the latest patch version of vLLM?
version = 0.6.0
I have 2 GPU machine, and I installed CUDA 12.2 on the other machine, but this issue still persists
looks like a machine / hardware problem. make sure you try it in different machines.
I used OpenMMLab's MMDetection library for multi GPU pre training on the COCO dataset and did not appear this exception. The training can be restarted many times. But when I use VLLM to load the model and exit, mmdetection training will no longer be able to start. So I still suspect that there may be a bug in the vllm library. Hope to receive an answer. Thanks.
(mmdet) root@llmgpu02:/home/hz/mmdetection# CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh ./configs/rtmdet/rtmdet_l_8xb32-300e_coco_.py 4
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
main()
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779]
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779] *****************************************
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779] *****************************************
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
09/13 13:26:29 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1733243605
GPU 0,1,2,3: NVIDIA A10
CUDA_HOME: /usr/local/cuda-12.2
NVCC: Cuda compilation tools, release 12.2, V12.2.91
GCC: gcc (Ubuntu 12.3.0-17ubuntu1) 12.3.0
PyTorch: 2.4.1
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.4
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 90.1
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.19.1
OpenCV: 4.10.0
MMEngine: 0.10.4
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1733243605
Distributed launcher: pytorch
Distributed training: True
GPU number: 4
------------------------------------------------------------
…………
creating index...
index created!
09/13 13:27:07 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/13 13:27:07 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/13 13:27:07 - mmengine - INFO - Checkpoints will be saved to /home/hz/mmdetection/work_dirs/rtmdet_l_8xb32-300e_coco_.
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
09/13 13:27:39 - mmengine - INFO - Epoch(train) [1][ 50/2566] base_lr: 1.9623e-04 lr: 1.9623e-04 eta: 5 days, 19:40:31 time: 0.6532 data_time: 0.0189 memory: 12354 loss: 2.3154 loss_cls: 0.9985 loss_bbox: 1.3169
09/13 13:28:11 - mmengine - INFO - Epoch(train) [1][ 100/2566] base_lr: 3.9643e-04 lr: 3.9643e-04 eta: 5 days, 16:43:34 time: 0.6257 data_time: 0.0032 memory: 12397 loss: 2.1314 loss_cls: 1.0023 loss_bbox: 1.1291
09/13 13:28:42 - mmengine - INFO - Epoch(train) [1][ 150/2566] base_lr: 5.9663e-04 lr: 5.9663e-04 eta: 5 days, 16:05:45 time: 0.6308 data_time: 0.0032 memory: 12633 loss: 1.9041 loss_cls: 0.9908 loss_bbox: 0.9133
09/13 13:29:14 - mmengine - INFO - Epoch(train) [1][ 200/2566] base_lr: 7.9683e-04 lr: 7.9683e-04 eta: 5 days, 15:58:32 time: 0.6345 data_time: 0.0032 memory: 12540 loss: 2.1094 loss_cls: 1.1339 loss_bbox: 0.9755
09/13 13:29:46 - mmengine - INFO - Epoch(train) [1][ 250/2566] base_lr: 9.9703e-04 lr: 9.9703e-04 eta: 5 days, 15:54:48 time: 0.6348 data_time: 0.0033 memory: 12980 loss: 1.9955 loss_cls: 1.1089 loss_bbox: 0.8866
09/13 13:30:18 - mmengine - INFO - Epoch(train) [1][ 300/2566] base_lr: 1.1972e-03 lr: 1.1972e-03 eta: 5 days, 15:55:26 time: 0.6364 data_time: 0.0032 memory: 12692 loss: 2.0213 loss_cls: 1.1336 loss_bbox: 0.8877
09/13 13:30:49 - mmengine - INFO - Epoch(train) [1][ 350/2566] base_lr: 1.3974e-03 lr: 1.3974e-03 eta: 5 days, 16:01:42 time: 0.6396 data_time: 0.0032 memory: 12683 loss: 2.0868 loss_cls: 1.1701 loss_bbox: 0.9166
09/13 13:31:21 - mmengine - INFO - Epoch(train) [1][ 400/2566] base_lr: 1.5976e-03 lr: 1.5976e-03 eta: 5 days, 15:55:55 time: 0.6332 data_time: 0.0032 memory: 12370 loss: 2.0334 loss_cls: 1.1523 loss_bbox: 0.8811
GPU info when MMDetections trainning:
(base) root@llmgpu02:/data1/coco# nvidia-smi
Fri Sep 13 13:39:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:14:00.0 Off | 0 |
| 0% 67C P0 131W / 150W | 14739MiB / 23028MiB | 64% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A10 Off | 00000000:15:00.0 Off | 0 |
| 0% 67C P0 128W / 150W | 13793MiB / 23028MiB | 76% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A10 Off | 00000000:18:00.0 Off | 0 |
| 0% 67C P0 127W / 150W | 13447MiB / 23028MiB | 72% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A10 Off | 00000000:1E:00.0 Off | 0 |
| 0% 67C P0 132W / 150W | 13919MiB / 23028MiB | 69% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A10 Off | 00000000:21:00.0 Off | 0 |
| 0% 35C P8 11W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A10 Off | 00000000:25:00.0 Off | 0 |
| 0% 34C P8 9W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A10 Off | 00000000:2D:00.0 Off | 0 |
| 0% 35C P8 10W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 55440 C ...al/miniconda3/envs/mmdet/bin/python 14730MiB |
| 1 N/A N/A 55441 C ...al/miniconda3/envs/mmdet/bin/python 13784MiB |
| 2 N/A N/A 55442 C ...al/miniconda3/envs/mmdet/bin/python 13438MiB |
| 3 N/A N/A 55443 C ...al/miniconda3/envs/mmdet/bin/python 13910MiB |
+-----------------------------------------------------------------------------------------+
(base) root@llmgpu02:/data1/coco#
I used OpenMMLab's MMDetection library for multi GPU pre training on the COCO dataset and did not appear this exception. The training can be restarted many times. But when I use VLLM to load the model and exit, mmdetection training will no longer be able to start.
Which GPUs are you using for each process?
I used OpenMMLab's MMDetection library for multi GPU pre training on the COCO dataset and did not appear this exception. The training can be restarted many times. But when I use VLLM to load the model and exit, mmdetection training will no longer be able to start.
Which GPUs are you using for each process?
GPU 0-3
Are you using both on the same GPUs simultaneously? In the log that you showed earlier, MMDetection already uses 14 GB / 23 GB memory. Given that you've set --gpu-memory-utilization 0.8
for vLLM, there is not enough memory left in the GPUs to run both.
Are you using both on the same GPUs simultaneously? In the log that you showed earlier, MMDetection already uses 14 GB / 23 GB memory. Given that you've set
--gpu-memory-utilization 0.8
for vLLM, there is not enough memory left in the GPUs to run both.
No, starting mmdetection training and using vllm to load large models are independent and not run at the same time
did you try it in another machine?
Your current environment
The output of `python collect_env.py`
```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 24.04.1 LTS (x86_64) GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-44-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.4.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10 GPU 1: NVIDIA A10 GPU 2: NVIDIA A10 GPU 3: NVIDIA A10 GPU 4: NVIDIA A10 GPU 5: NVIDIA A10 GPU 6: NVIDIA A10 Nvidia driver version: 550.107.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 104 On-line CPU(s) list: 0-103 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz BIOS Model name: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz CPU @ 2.2GHz BIOS CPU family: 179 CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 2 Stepping: 6 CPU(s) scaling MHz: 24% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4400.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 2.4 MiB (52 instances) L1i cache: 1.6 MiB (52 instances) L2 cache: 65 MiB (52 instances) L3 cache: 78 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PXB PXB PXB PXB PXB 0-25,52-77 0 N/A GPU1 PIX X PXB PXB PXB PXB PXB 0-25,52-77 0 N/A GPU2 PXB PXB X PXB PXB PXB PXB 0-25,52-77 0 N/A GPU3 PXB PXB PXB X PXB PXB PXB 0-25,52-77 0 N/A GPU4 PXB PXB PXB PXB X PXB PXB 0-25,52-77 0 N/A GPU5 PXB PXB PXB PXB PXB X PXB 0-25,52-77 0 N/A GPU6 PXB PXB PXB PXB PXB PXB X 0-25,52-77 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```Model Input Dumps
No response
🐛 Describe the bug
After every reboot of my GPU machine,
The NCCL test script on the official website [Getting Started/Debugging Tips] was successful when running.
Running scripts in the conda virtual environment
Or run the container in Docker
All of them are successful.
After exiting the running model, or exiting the Docker container, closing the container, or even shutting down the Docker service, when I want to run the script in the conda virtual environment or Docker as in the step 2 above, the model cannot load and the program get stuck. Among them, the console output of running the container in Docker is as follows:
At this point, running the NCCL test script on the official website [Getting Started/Debugging Tips] again also get stuck, and the console output is as follows:
[rank1]:[E913 05:31:49.230512045 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank1]:[E913 05:31:49.230536304 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E913 05:31:49.230544762 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E913 05:31:49.230718430 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank2]:[E913 05:31:49.230739714 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E913 05:31:49.230746779 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E913 05:31:49.232083952 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78789b199f86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x78784c3c88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78784c3cf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x78784c3d16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x787899edbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x78789bc9ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x78789bd29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E913 05:31:49.232247737 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78d59a167f86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x78d54b3c88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78d54b3cf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x78d54b3d16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x78d598edbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x78d59ac9ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x78d59ad29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)
llmgpu01:57120:57158 [0] NCCL INFO comm 0x80450d0 rank 0 nranks 4 cudaDev 0 busId 14000 - Abort COMPLETE [rank0]:[E913 05:31:49.400242426 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank0]:[E913 05:31:49.400271008 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E913 05:31:49.400283907 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E913 05:31:49.403822047 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1f883ddf86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1f395c88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1f395cf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1f395d16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f1f870dbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x7f1f8909ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x7f1f89129c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)
W0913 05:31:49.562000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 57120 closing signal SIGTERM W0913 05:31:49.563000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 57121 closing signal SIGTERM W0913 05:31:49.563000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 57122 closing signal SIGTERM E0913 05:31:49.878000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 3 (pid: 57123) of binary: /usr/local/miniconda3/envs/vllmenv/bin/python Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllmenv/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
test.py FAILED
Failures: