Closed venki-lfc closed 5 months ago
@venki-lfc when stalled, what does nvidia-smi report for GPU %load and memory usage?
This is how the GPU looks like during the stalling.
This image shows the process names
Same here. Both for 0.3.3 and 0.4.0
Thanks @venki-lfc this matches my experience, 0.4.0-post1 on only 4 specific GPUs on an 8x H100-PCIe system:
I only see this behavior for these 4 specific GPUs on the system; other configurations (e.g. 1, 2, or 8 GPU) appear unaffected even when they utilize the same hardware.
I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch).
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 SYS SYS SYS SYS SYS SYS 0-95 0 N/A
GPU1 NV12 X SYS SYS SYS SYS SYS SYS 0-95 0 N/A
GPU2 SYS SYS X NV12 SYS SYS SYS SYS 0-95 0 N/A
GPU3 SYS SYS NV12 X SYS SYS SYS SYS 0-95 0 N/A
GPU4 SYS SYS SYS SYS X SYS SYS NV12 96-191 1 N/A
GPU5 SYS SYS SYS SYS SYS X NV12 PIX 96-191 1 N/A
GPU6 SYS SYS SYS SYS SYS NV12 X PIX 96-191 1 N/A
GPU7 SYS SYS SYS SYS NV12 PIX PIX X 96-191 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
-+-[0000:e0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
+-[0000:c0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
| +-01.1-[c1-c8]----00.0-[c2-c8]--+-00.0-[c3]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU5 NVLINK 4
| | +-01.0-[c4]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU6 NVLINK 4
| | +-02.0-[c5]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU7 NVLINK 3
| | +-03.0-[c6]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
| | +-04.0-[c7]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
| | \-1f.0-[c8]----00.0 Broadcom / LSI PCIe Switch management endpoint
+-[0000:a0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
+-[0000:80]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
| +-01.1-[81-87]----00.0-[82-87]--+-00.0-[83]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
| | +-01.0-[84]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
| | +-02.0-[85]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| | +-03.0-[86]--+-00.0 Intel Corporation Ethernet Controller E810-C for QSFP
| | | \-00.1 Intel Corporation Ethernet Controller E810-C for QSFP
| | \-04.0-[87]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU4 NVLINK 3
| \-07.1-[88]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14ac
| \-00.5 Advanced Micro Devices, Inc. [AMD] Genoa CCP/PSP 4.0 Device
+-[0000:60]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
+-[0000:40]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
| +-01.1-[41-48]----00.0-[42-48]--+-00.0-[43]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| | +-01.0-[44]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| | +-02.0-[45]--+-00.0 Intel Corporation Ethernet Controller E810-C for QSFP
| | | \-00.1 Intel Corporation Ethernet Controller E810-C for QSFP
| | +-03.0-[46]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU2 NVLINK 2
| | +-04.0-[47]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU3 NVLINK 2
| | \-1f.0-[48]----00.0 Broadcom / LSI PCIe Switch management endpoint
\-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14a4
+-01.1-[01-07]----00.0-[02-07]--+-00.0-[03]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| +-01.0-[04]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| +-02.0-[05]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| +-03.0-[06]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU0 NVLINK 1
| \-04.0-[07]----00.0 NVIDIA Corporation GH100 [H100 PCIe] GPU1 NVLINK 1
+-05.1-[08]--+-00.0 Intel Corporation Ethernet Controller X710 for 10GBASE-T
| \-00.1 Intel Corporation Ethernet Controller X710 for 10GBASE-T
So far we're seeing this on AMD and Intel CPU and Ada/Hopper GPU's. (My collect_env output is @ #3892). @alexanderfrey what hardware are you using?
Testing various releases of the stock Docker containers with Llama2-70B, I see:
v.0.4.0: Hangs (nvidia-nccl-cu12 2.18.1; libnccl2 2.17.1-1+cuda12.1)
v0.3.3: OK (nvidia-nccl-cu12 2.18.1; libnccl2 2.17.1-1+cuda12.1)
v0.3.2: OK
v.0.2.7: OK
It was straightforward to test by swapping containers, but I won't have time to perform a full bisect/rebuild 0.3.3->0.4.0 for a few weeks.
@agt did you change nccl version via VLLM_NCCL_SO_PATH
? Normally people don't use nccl 2.17.1
.
@youkaichao That's the version shipped in https://hub.docker.com/r/vllm/vllm-openai - happy to swap in a new version via VLLM_NCCL_SO_PATH, which would you suggest?
@agt did you change nccl version via
VLLM_NCCL_SO_PATH
? Normally people don't usenccl 2.17.1
.
I just did a pip intall vllm
on the pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel
image. So nccl 2.17.1 came together.
I just found a solution that works!
I set the env variable as export NCCL_P2P_DISABLE=1
from langchain_community.llms import VLLM
llm = VLLM(model="./mixtral-8x7B-instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83", tensor_parallel_size=4, trust_remote_code=True, enforce_eager=True)
This works for me now :)
@venki-lfc glad to hear! Disabling P2P will hurt performance, so I'd like to continue pursuing - want to keep this issue open, or should I create a new one?
@venki-lfc glad to hear! Disabling P2P will hurt performance, so I'd like to continue pursuing - want to keep this issue open, or should I create a new one?
I guess we can keep the issue open :) Obviously mine's just a work around and doesn't address the root cause of the issue.
@agt did you change nccl version via
VLLM_NCCL_SO_PATH
? Normally people don't usenccl 2.17.1
.
Ahh - 2.17.1 was the system NCCL installed under /usr/lib; the PyPi version was hiding under 'libnccl.so.2', and is indeed NCCL version 2.18.1+cuda12.1 . That's consistent with the Pytorch 2.1.2 requirements.
I just found a solution that works! I set the env variable as
export NCCL_P2P_DISABLE=1
Good job! nccl is quite a black box, and we have a hard time with it :(
Thanks @venki-lfc this matches my experience, 0.4.0-post1 on only 4 specific GPUs on an 8x H100-PCIe system: I only see this behavior for these 4 specific GPUs on the system; other configurations (e.g. 1, 2, or 8 GPU) appear unaffected even when they utilize the same hardware.
I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch).
I've tested on both 8xH100 and 8xA100-40GB and cannot seem to load a model on even tensor-parallel-size=1
. I've also tried disabling P2P, but that still doesn't help. Any suggestions? Here's my env:
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31
Python version: 3.11.8 (main, Feb 25 2024, 16:41:26) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1048-oracle-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-254
Off-line CPU(s) list: 255
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7J13 64-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2550.000
CPU max MHz: 3673.0950
CPU min MHz: 1500.0000
BogoMIPS: 4900.16
Virtualization: AMD-V
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-254
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] torchvision==0.17.1+cu121
[pip3] triton==2.1.0
[conda] Could not collect
I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch). I've tested on both 8xH100 and 8xA100-40GB and cannot seem to load a model on even
tensor-parallel-size=1
. I've also tried disabling P2P, but that still doesn't help. Any suggestions? `
Disabling CUDA graphs (--enforce-eager) seems to sidestep the problem for me, like NCCL_P2P_DISABLE=1 at the cost of reduced performance. Lots of changes to Pytorch, NCCL, etc. since 0.4.0.post1, hope to build current code this week and test with cuda graphs enabled. Apparent functionality was a 1-time fluke; see next comment.
@youkaichao Thank you for #4079 - throwing that into my 0.4.0.post1 container, I found that the 3 non-lead workers all were stuck within CustomAllreduce._gather_ipc_meta() despite the code's intention to disable CustomAllReduce as "it's not supported on more than two PCIe-only GPUs.".
Last call logged across the various processes before the stall: ( full @ vllm_trace_frame_for_process.tgz )
==> vllm_trace_frame_for_process_1_thread_139973314331072_at_2024-04-15_17:58:24.978956.log <==
2024-04-15 17:59:04.732274 Return from init_device in /workspace/vllm/worker/worker.py:104
==> vllm_trace_frame_for_process_1004_thread_139786598171072_at_2024-04-15_17:58:33.965163.log <==
2024-04-15 17:58:43.840296 Return from get_node_and_gpu_ids in /workspace/vllm/engine/ray_utils.py:53
==> vllm_trace_frame_for_process_1113_thread_140163469054400_at_2024-04-15_17:58:36.739553.log <==
2024-04-15 17:59:04.981153 Call to _gather_ipc_meta in /workspace/vllm/model_executor/parallel_utils/custom_all_reduce.py:222
==> vllm_trace_frame_for_process_1228_thread_139805130449344_at_2024-04-15_17:58:39.506931.log <==
2024-04-15 17:59:05.511878 Call to _gather_ipc_meta in /workspace/vllm/model_executor/parallel_utils/custom_all_reduce.py:222
==> vllm_trace_frame_for_process_1333_thread_140525308137920_at_2024-04-15_17:58:42.339446.log <==
2024-04-15 17:59:04.952552 Call to _gather_ipc_meta in /workspace/vllm/model_executor/parallel_utils/custom_all_reduce.py:222
Launching with --disable-custom-all-reduce
has been solid now across dozens of restarts in various configurations, so for me the mystery is now why the workers end up using that code.
@venki-lfc @nidhishs would you mind checking whether adding that flag fixes things for you? (I believe disabling custom AllReduce should impact performance less than disabling P2P alltogether.)
Hello @agt , --disable-custom-all-reduce
did not work for me. I can only run the model on multiple GPUs via export NCCL_P2P_DISABLE=1
and by setting --enforce-eager
so far
--disable-custom-all-reduce
did not work for me. I can only run the model on multiple GPUs viaexport NCCL_P2P_DISABLE=1
and by setting--enforce-eager
so far
Hi @venki-lfc, sorry to hear that didn't work! 0.4.1 will include an option to log all function calls, perhaps doing so will identify the culprit as it did for me. I'd be happy to review if you post that info in a new bug.
I am using microsoft/Phi-3-vision-128k-instruct
and it gives out of memory error. But if I use facebook/opt-13b
, then it works fine even if it much bigger model.
Command with output
$ vllm serve microsoft/Phi-3-vision-128k-instruct --tensor-parallel-size 4 --trust-remote-code --dtype=half
INFO 07-30 18:05:10 api_server.py:286] vLLM API server version 0.5.3.post1
INFO 07-30 18:05:10 api_server.py:287] args: Namespace(model_tag='microsoft/Phi-3-vision-128k-instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, model='microsoft/Phi-3-vision-128k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7950f1021750>)
WARNING 07-30 18:05:11 config.py:1433] Casting torch.bfloat16 to torch.float16.
INFO 07-30 18:05:15 config.py:723] Defaulting to use mp for distributed inference
WARNING 07-30 18:05:15 arg_utils.py:776] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 07-30 18:05:15 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-vision-128k-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 07-30 18:05:15 multiproc_gpu_executor.py:60] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-30 18:05:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=690776) WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=690777) WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=690775) WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-30 18:05:19 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x79519cefdc90>, local_subscribe_port=45175, remote_subscribe_port=None)
INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.33it/s]
INFO 07-30 18:05:21 model_runner.py:732] Loading model weights took 2.1571 GB
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:21 model_runner.py:732] Loading model weights took 2.1571 GB
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:22 model_runner.py:732] Loading model weights took 2.1571 GB
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:22 model_runner.py:732] Loading model weights took 2.1571 GB
lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
(VllmWorkerProcess pid=690775) lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(VllmWorkerProcess pid=690775) warnings.warn(
(VllmWorkerProcess pid=690777) lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(VllmWorkerProcess pid=690777) warnings.warn(
(VllmWorkerProcess pid=690776) lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(VllmWorkerProcess pid=690776) warnings.warn(
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA out of memory. Tried to allocate 8.27 GiB. GPU has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/worker/model_runner.py", line 935, in profile_run
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/worker/model_runner.py", line 1354, in execute_model
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] vision_embeddings = self.vision_embed_tokens(
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] img_features = self.get_img_features(pixel_values)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] img_feature = self.img_processor(img_embeds)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self.vision_model(pixel_values=pixel_values)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_states = self.encoder(inputs_embeds=hidden_states)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_states = encoder_layer(hidden_states)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA out of memory. Tried to allocate 8.27 GiB. GPU has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/worker/model_runner.py", line 935, in profile_run
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/worker/model_runner.py", line 1354, in execute_model
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] vision_embeddings = self.vision_embed_tokens(
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] img_features = self.get_img_features(pixel_values)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] img_feature = self.img_processor(img_embeds)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self.vision_model(pixel_values=pixel_values)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_states = self.encoder(inputs_embeds=hidden_states)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_states = encoder_layer(hidden_states)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "/bin/vllm", line 8, in <module>
[rank0]: sys.exit(main())
[rank0]: File "/vllm/vllm/scripts.py", line 149, in main
[rank0]: args.dispatch_function(args)
[rank0]: File "/vllm/vllm/scripts.py", line 29, in serve
[rank0]: asyncio.run(run_server(args))
[rank0]: File "/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]: return loop.run_until_complete(main)
[rank0]: File "/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]: return future.result()
[rank0]: File "/vllm/vllm/entrypoints/openai/api_server.py", line 289, in run_server
[rank0]: app = await init_app(args, llm_engine)
[rank0]: File "/vllm/vllm/entrypoints/openai/api_server.py", line 229, in init_app
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]: File "/vllm/vllm/engine/async_llm_engine.py", line 470, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/vllm/vllm/engine/async_llm_engine.py", line 551, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/vllm/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/vllm/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/vllm/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: File "/vllm/vllm/executor/multiproc_gpu_executor.py", line 195, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/vllm/vllm/worker/model_runner.py", line 935, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/vllm/vllm/worker/model_runner.py", line 1354, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
[rank0]: vision_embeddings = self.vision_embed_tokens(
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
[rank0]: img_features = self.get_img_features(pixel_values)
[rank0]: File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
[rank0]: img_feature = self.img_processor(img_embeds)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
[rank0]: return self.vision_model(pixel_values=pixel_values)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
[rank0]: hidden_states = self.encoder(inputs_embeds=hidden_states)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
[rank0]: hidden_states = encoder_layer(hidden_states)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
[rank0]: hidden_states, _ = self.self_attn(hidden_states=hidden_states)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
[rank0]: attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU
ERROR 07-30 18:05:25 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 690777 died, exit code: -15
INFO 07-30 18:05:25 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '[rank0]: File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
[rank0]: vision_embeddings = self.vision_embed_tokens(
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
[rank0]: img_features = self.get_img_features(pixel_values)
[rank0]: File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
[rank0]: img_feature = self.img_processor(img_embeds)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
[rank0]: return self.vision_model(pixel_values=pixel_values)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
[rank0]: hidden_states = self.encoder(inputs_embeds=hidden_states)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
[rank0]: hidden_states = encoder_layer(hidden_states)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
[rank0]: hidden_states, _ = self.self_attn(hidden_states=hidden_states)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
[rank0]: attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU
ERROR 07-30 18:05:25 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 690777 died, exit code: -15
INFO 07-30 18:05:25 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
My environment
PyTorch version: 2.3.1
OS: Ubuntu 22.04.4 LTS
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
CMake version: version 3.30.1
Libc version: (Ubuntu GLIBC 2.35-0ubuntu3.8)
Python version: 3.10.12
Is CUDA available: True
CUDA runtime version: 12.4
Nvidia driver version: 550.54.14
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
Neuron SDK Version: N/A
vLLM Version: -e git+https://github.com/vllm-project/vllm.git@c66c7f86aca956014d9ec6cc7a3e6001037e4655#egg=vllm
Your current environment
🐛 Describe the bug
When I try to load the model by using the following command
The model is not loaded at all, I get the following information on the CLI and that's it. The loading is never finished.
I can see that the 2 GPU devices are occupied while the above message is displayed, but nothing else. The line of code is never fully executed.
When I try to load the model using only one GPU, the loading process is smooth.
Below is the screenshot of the successfull loading message:
The llm inference is quite fast and everyhting works as expected.
So the problem clearly lies with multiple GPUs. This issue happens with all the models and not particular to just one organisation. Can someone please help me in this regard? What am I doing wrong? Is it something due to nccl or is something mssing? Any help is appreciated, thanks :)