[Bug]: LLM is not getting loaded on multiple GPUs but works fine on a single GPU

venki-lfc commented 5 months ago

Your current environment

PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.1
Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX 6000 Ada Generation
GPU 1: NVIDIA RTX 6000 Ada Generation
GPU 2: NVIDIA RTX 6000 Ada Generation
GPU 3: NVIDIA RTX 6000 Ada Generation

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 57 bits virtual
CPU(s):                             96
On-line CPU(s) list:                0-95
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              143
Model name:                         Intel(R) Xeon(R) Gold 5418Y
Stepping:                           8
Frequency boost:                    enabled
CPU MHz:                            989.502
CPU max MHz:                        2001.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4000.00
Virtualization:                     VT-x
L1d cache:                          2.3 MiB
L1i cache:                          1.5 MiB
L2 cache:                           96 MiB
L3 cache:                           90 MiB
NUMA node0 CPU(s):                  0-23,48-71
NUMA node1 CPU(s):                  24-47,72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] onnxruntime==1.17.1
[pip3] torch==2.1.2
[pip3] torchaudio==2.1.2
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0           py310h5eee18b_1  
[conda] mkl_fft                   1.3.8           py310h5eee18b_0  
[conda] mkl_random                1.2.4           py310hdb19cb5_0  
[conda] numpy                     1.26.2          py310h5f9d8c6_0  
[conda] numpy-base                1.26.2          py310hb5e798b_0  
[conda] pytorch                   2.1.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.1.2               py310_cu121    pytorch
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchtriton               2.1.0                     py310    pytorch
[conda] torchvision               0.16.2              py310_cu121    pytorchROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     SYS     SYS     24-47,72-95     1               N/A
GPU1    PIX      X      SYS     SYS     24-47,72-95     1               N/A
GPU2    SYS     SYS      X      PIX     24-47,72-95     1               N/A
GPU3    SYS     SYS     PIX      X      24-47,72-95     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

When I try to load the model by using the following command

from vllm import LLM
llm = LLM(model="./mistral/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/", tensor_parallel_size=2)

The model is not loaded at all, I get the following information on the CLI and that's it. The loading is never finished.

I can see that the 2 GPU devices are occupied while the above message is displayed, but nothing else. The line of code is never fully executed.

When I try to load the model using only one GPU, the loading process is smooth.

from vllm import LLM
llm = LLM(model="./mistral/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/", tensor_parallel_size=1)

Below is the screenshot of the successfull loading message:

The llm inference is quite fast and everyhting works as expected.

So the problem clearly lies with multiple GPUs. This issue happens with all the models and not particular to just one organisation. Can someone please help me in this regard? What am I doing wrong? Is it something due to nccl or is something mssing? Any help is appreciated, thanks :)

agt commented 5 months ago

@venki-lfc when stalled, what does nvidia-smi report for GPU %load and memory usage?

venki-lfc commented 5 months ago

This is how the GPU looks like during the stalling. Capture

This image shows the process names Capture2

alexanderfrey commented 5 months ago

Same here. Both for 0.3.3 and 0.4.0

agt commented 5 months ago

Thanks @venki-lfc this matches my experience, 0.4.0-post1 on only 4 specific GPUs on an 8x H100-PCIe system:

I only see this behavior for these 4 specific GPUs on the system; other configurations (e.g. 1, 2, or 8 GPU) appear unaffected even when they utilize the same hardware.

I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch).


    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    SYS SYS SYS SYS SYS SYS 0-95    0       N/A
GPU1    NV12     X  SYS SYS SYS SYS SYS SYS 0-95    0       N/A
GPU2    SYS SYS  X  NV12    SYS SYS SYS SYS 0-95    0       N/A
GPU3    SYS SYS NV12     X  SYS SYS SYS SYS 0-95    0       N/A
GPU4    SYS SYS SYS SYS  X  SYS SYS NV12    96-191  1       N/A
GPU5    SYS SYS SYS SYS SYS  X  NV12    PIX 96-191  1       N/A
GPU6    SYS SYS SYS SYS SYS NV12     X  PIX 96-191  1       N/A
GPU7    SYS SYS SYS SYS NV12    PIX PIX  X  96-191  1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

-+-[0000:e0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
 +-[0000:c0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
 |           +-01.1-[c1-c8]----00.0-[c2-c8]--+-00.0-[c3]----00.0  NVIDIA Corporation GH100 [H100 PCIe]  GPU5 NVLINK 4
 |           |                               +-01.0-[c4]----00.0  NVIDIA Corporation GH100 [H100 PCIe]  GPU6 NVLINK 4
 |           |                               +-02.0-[c5]----00.0  NVIDIA Corporation GH100 [H100 PCIe]  GPU7 NVLINK 3
 |           |                               +-03.0-[c6]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 |           |                               +-04.0-[c7]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 |           |                               \-1f.0-[c8]----00.0  Broadcom / LSI PCIe Switch management endpoint
 +-[0000:a0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
 +-[0000:80]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
 |           +-01.1-[81-87]----00.0-[82-87]--+-00.0-[83]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 |           |                               +-01.0-[84]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 |           |                               +-02.0-[85]----00.0  Broadcom / LSI Virtual PCIe Placeholder Endpoint
 |           |                               +-03.0-[86]--+-00.0  Intel Corporation Ethernet Controller E810-C for QSFP
 |           |                               |            \-00.1  Intel Corporation Ethernet Controller E810-C for QSFP
 |           |                               \-04.0-[87]----00.0  NVIDIA Corporation GH100 [H100 PCIe]  GPU4 NVLINK 3
 |           \-07.1-[88]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14ac
 |                        \-00.5  Advanced Micro Devices, Inc. [AMD] Genoa CCP/PSP 4.0 Device
 +-[0000:60]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
 +-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
 |           +-01.1-[41-48]----00.0-[42-48]--+-00.0-[43]----00.0  Broadcom / LSI Virtual PCIe Placeholder Endpoint
 |           |                               +-01.0-[44]----00.0  Broadcom / LSI Virtual PCIe Placeholder Endpoint
 |           |                               +-02.0-[45]--+-00.0  Intel Corporation Ethernet Controller E810-C for QSFP
 |           |                               |            \-00.1  Intel Corporation Ethernet Controller E810-C for QSFP
 |           |                               +-03.0-[46]----00.0  NVIDIA Corporation GH100 [H100 PCIe]   GPU2 NVLINK 2
 |           |                               +-04.0-[47]----00.0  NVIDIA Corporation GH100 [H100 PCIe]   GPU3 NVLINK 2
 |           |                               \-1f.0-[48]----00.0  Broadcom / LSI PCIe Switch management endpoint
 \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14a4
             +-01.1-[01-07]----00.0-[02-07]--+-00.0-[03]----00.0  Broadcom / LSI Virtual PCIe Placeholder Endpoint
             |                               +-01.0-[04]----00.0  Broadcom / LSI Virtual PCIe Placeholder Endpoint
             |                               +-02.0-[05]----00.0  Broadcom / LSI Virtual PCIe Placeholder Endpoint
             |                               +-03.0-[06]----00.0  NVIDIA Corporation GH100 [H100 PCIe]   GPU0 NVLINK 1
             |                               \-04.0-[07]----00.0  NVIDIA Corporation GH100 [H100 PCIe]   GPU1 NVLINK 1
             +-05.1-[08]--+-00.0  Intel Corporation Ethernet Controller X710 for 10GBASE-T
             |            \-00.1  Intel Corporation Ethernet Controller X710 for 10GBASE-T

agt commented 5 months ago

So far we're seeing this on AMD and Intel CPU and Ada/Hopper GPU's. (My collect_env output is @ #3892). @alexanderfrey what hardware are you using?

Testing various releases of the stock Docker containers with Llama2-70B, I see:

v.0.4.0: Hangs (nvidia-nccl-cu12 2.18.1; ~~libnccl2 2.17.1-1+cuda12.1~~) v0.3.3: OK (nvidia-nccl-cu12 2.18.1; ~~libnccl2 2.17.1-1+cuda12.1~~) v0.3.2: OK v.0.2.7: OK

It was straightforward to test by swapping containers, but I won't have time to perform a full bisect/rebuild 0.3.3->0.4.0 for a few weeks.

youkaichao commented 5 months ago

@agt did you change nccl version via VLLM_NCCL_SO_PATH ? Normally people don't use nccl 2.17.1 .

agt commented 5 months ago

@youkaichao That's the version shipped in https://hub.docker.com/r/vllm/vllm-openai - happy to swap in a new version via VLLM_NCCL_SO_PATH, which would you suggest?

venki-lfc commented 5 months ago

@agt did you change nccl version via VLLM_NCCL_SO_PATH ? Normally people don't use nccl 2.17.1 .

I just did a pip intall vllm on the pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel image. So nccl 2.17.1 came together.

venki-lfc commented 5 months ago

I just found a solution that works! I set the env variable as export NCCL_P2P_DISABLE=1

from langchain_community.llms import VLLM

llm = VLLM(model="./mixtral-8x7B-instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83", tensor_parallel_size=4, trust_remote_code=True, enforce_eager=True)

This works for me now :)

agt commented 5 months ago

@venki-lfc glad to hear! Disabling P2P will hurt performance, so I'd like to continue pursuing - want to keep this issue open, or should I create a new one?

venki-lfc commented 5 months ago

@venki-lfc glad to hear! Disabling P2P will hurt performance, so I'd like to continue pursuing - want to keep this issue open, or should I create a new one?

I guess we can keep the issue open :) Obviously mine's just a work around and doesn't address the root cause of the issue.

agt commented 5 months ago

@agt did you change nccl version via VLLM_NCCL_SO_PATH ? Normally people don't use nccl 2.17.1 .

Ahh - 2.17.1 was the system NCCL installed under /usr/lib; the PyPi version was hiding under 'libnccl.so.2', and is indeed NCCL version 2.18.1+cuda12.1 . That's consistent with the Pytorch 2.1.2 requirements.

youkaichao commented 5 months ago

I just found a solution that works! I set the env variable as export NCCL_P2P_DISABLE=1

Good job! nccl is quite a black box, and we have a hard time with it :(

nidhishs commented 5 months ago

Thanks @venki-lfc this matches my experience, 0.4.0-post1 on only 4 specific GPUs on an 8x H100-PCIe system: I only see this behavior for these 4 specific GPUs on the system; other configurations (e.g. 1, 2, or 8 GPU) appear unaffected even when they utilize the same hardware.

I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch).

I've tested on both 8xH100 and 8xA100-40GB and cannot seem to load a model on even tensor-parallel-size=1. I've also tried disabling P2P, but that still doesn't help. Any suggestions? Here's my env:

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.11.8 (main, Feb 25 2024, 16:41:26) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1048-oracle-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             256
On-line CPU(s) list:                0-254
Off-line CPU(s) list:               255
Thread(s) per core:                 1
Core(s) per socket:                 64
Socket(s):                          2
NUMA node(s):                       8
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7J13 64-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            2550.000
CPU max MHz:                        3673.0950
CPU min MHz:                        1500.0000
BogoMIPS:                           4900.16
Virtualization:                     AMD-V
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           32 MiB
L3 cache:                           256 MiB
NUMA node0 CPU(s):                  0-15,128-143
NUMA node1 CPU(s):                  16-31,144-159
NUMA node2 CPU(s):                  32-47,160-175
NUMA node3 CPU(s):                  48-63,176-191
NUMA node4 CPU(s):                  64-79,192-207
NUMA node5 CPU(s):                  80-95,208-223
NUMA node6 CPU(s):                  96-111,224-239
NUMA node7 CPU(s):                  112-127,240-254
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] torchvision==0.17.1+cu121
[pip3] triton==2.1.0
[conda] Could not collect

agt commented 5 months ago

I suspect there's some sort of NCCL race/deadlock occurring, triggered by differences in PCIe bus layout for (GPU 0/1/2/3, with 0/1 and 2/3 separated by multiple PCI hops) and (GPU 4/5/6/7, 3 of which share a common PCI switch). I've tested on both 8xH100 and 8xA100-40GB and cannot seem to load a model on even tensor-parallel-size=1. I've also tried disabling P2P, but that still doesn't help. Any suggestions? `

~~Disabling CUDA graphs (--enforce-eager) seems to sidestep the problem for me, like NCCL_P2P_DISABLE=1 at the cost of reduced performance.~~ Lots of changes to Pytorch, NCCL, etc. since 0.4.0.post1, hope to build current code this week and test with cuda graphs enabled. Apparent functionality was a 1-time fluke; see next comment.

agt commented 5 months ago

@youkaichao Thank you for #4079 - throwing that into my 0.4.0.post1 container, I found that the 3 non-lead workers all were stuck within CustomAllreduce._gather_ipc_meta() despite the code's intention to disable CustomAllReduce as "it's not supported on more than two PCIe-only GPUs.".

Last call logged across the various processes before the stall: ( full @ vllm_trace_frame_for_process.tgz )

==> vllm_trace_frame_for_process_1_thread_139973314331072_at_2024-04-15_17:58:24.978956.log <==
2024-04-15 17:59:04.732274 Return from init_device in /workspace/vllm/worker/worker.py:104

==> vllm_trace_frame_for_process_1004_thread_139786598171072_at_2024-04-15_17:58:33.965163.log <==
2024-04-15 17:58:43.840296 Return from get_node_and_gpu_ids in /workspace/vllm/engine/ray_utils.py:53

==> vllm_trace_frame_for_process_1113_thread_140163469054400_at_2024-04-15_17:58:36.739553.log <==
2024-04-15 17:59:04.981153 Call to _gather_ipc_meta in /workspace/vllm/model_executor/parallel_utils/custom_all_reduce.py:222

==> vllm_trace_frame_for_process_1228_thread_139805130449344_at_2024-04-15_17:58:39.506931.log <==
2024-04-15 17:59:05.511878 Call to _gather_ipc_meta in /workspace/vllm/model_executor/parallel_utils/custom_all_reduce.py:222

==> vllm_trace_frame_for_process_1333_thread_140525308137920_at_2024-04-15_17:58:42.339446.log <==
2024-04-15 17:59:04.952552 Call to _gather_ipc_meta in /workspace/vllm/model_executor/parallel_utils/custom_all_reduce.py:222

Launching with --disable-custom-all-reduce has been solid now across dozens of restarts in various configurations, so for me the mystery is now why the workers end up using that code.

@venki-lfc @nidhishs would you mind checking whether adding that flag fixes things for you? (I believe disabling custom AllReduce should impact performance less than disabling P2P alltogether.)

venki-lfc commented 5 months ago

Hello @agt , --disable-custom-all-reduce did not work for me. I can only run the model on multiple GPUs via export NCCL_P2P_DISABLE=1 and by setting --enforce-eager so far

agt commented 5 months ago

--disable-custom-all-reduce did not work for me. I can only run the model on multiple GPUs via export NCCL_P2P_DISABLE=1 and by setting --enforce-eager so far

Hi @venki-lfc, sorry to hear that didn't work! 0.4.1 will include an option to log all function calls, perhaps doing so will identify the culprit as it did for me. I'd be happy to review if you post that info in a new bug.

paniabhisek commented 1 month ago

I am using microsoft/Phi-3-vision-128k-instruct and it gives out of memory error. But if I use facebook/opt-13b, then it works fine even if it much bigger model.

Command with output

$ vllm serve microsoft/Phi-3-vision-128k-instruct     --tensor-parallel-size 4 --trust-remote-code --dtype=half
INFO 07-30 18:05:10 api_server.py:286] vLLM API server version 0.5.3.post1
INFO 07-30 18:05:10 api_server.py:287] args: Namespace(model_tag='microsoft/Phi-3-vision-128k-instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, model='microsoft/Phi-3-vision-128k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7950f1021750>)
WARNING 07-30 18:05:11 config.py:1433] Casting torch.bfloat16 to torch.float16.
INFO 07-30 18:05:15 config.py:723] Defaulting to use mp for distributed inference
WARNING 07-30 18:05:15 arg_utils.py:776] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 07-30 18:05:15 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-vision-128k-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 07-30 18:05:15 multiproc_gpu_executor.py:60] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-30 18:05:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:15 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:15 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-30 18:05:18 utils.py:774] Found nccl from library libnccl.so.2
INFO 07-30 18:05:18 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=690776) WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=690777) WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=690775) WARNING 07-30 18:05:19 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-30 18:05:19 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x79519cefdc90>, local_subscribe_port=45175, remote_subscribe_port=None)
INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 selector.py:54] Using XFormers backend.
INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:19 weight_utils.py:224] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.33it/s]

INFO 07-30 18:05:21 model_runner.py:732] Loading model weights took 2.1571 GB
(VllmWorkerProcess pid=690775) INFO 07-30 18:05:21 model_runner.py:732] Loading model weights took 2.1571 GB
(VllmWorkerProcess pid=690777) INFO 07-30 18:05:22 model_runner.py:732] Loading model weights took 2.1571 GB
(VllmWorkerProcess pid=690776) INFO 07-30 18:05:22 model_runner.py:732] Loading model weights took 2.1571 GB
lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(VllmWorkerProcess pid=690775) lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(VllmWorkerProcess pid=690775)   warnings.warn(
(VllmWorkerProcess pid=690777) lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(VllmWorkerProcess pid=690777)   warnings.warn(
(VllmWorkerProcess pid=690776) lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(VllmWorkerProcess pid=690776)   warnings.warn(
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA out of memory. Tried to allocate 8.27 GiB. GPU  has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/worker/model_runner.py", line 935, in profile_run
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/worker/model_runner.py", line 1354, in execute_model
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     vision_embeddings = self.vision_embed_tokens(
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     img_features = self.get_img_features(pixel_values)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     img_feature = self.img_processor(img_embeds)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self.vision_model(pixel_values=pixel_values)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_states = self.encoder(inputs_embeds=hidden_states)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_states = encoder_layer(hidden_states)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU  has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=690777) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA out of memory. Tried to allocate 8.27 GiB. GPU  has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/worker/model_runner.py", line 935, in profile_run
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/worker/model_runner.py", line 1354, in execute_model
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     vision_embeddings = self.vision_embed_tokens(
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     img_features = self.get_img_features(pixel_values)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     img_feature = self.img_processor(img_embeds)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self.vision_model(pixel_values=pixel_values)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_states = self.encoder(inputs_embeds=hidden_states)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_states = encoder_layer(hidden_states)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]   File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]     attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU  has a total capacity of 15.56 GiB of which 6.98 GiB is free. Including non-PyTorch memory, this process has 8.58 GiB memory in use. Of the allocated memory 7.80 GiB is allocated by PyTorch, and 590.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=690775) ERROR 07-30 18:05:25 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "/vllm/vllm/scripts.py", line 149, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/vllm/vllm/scripts.py", line 29, in serve
[rank0]:     asyncio.run(run_server(args))
[rank0]:   File "/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)
[rank0]:   File "/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()
[rank0]:   File "/vllm/vllm/entrypoints/openai/api_server.py", line 289, in run_server
[rank0]:     app = await init_app(args, llm_engine)
[rank0]:   File "/vllm/vllm/entrypoints/openai/api_server.py", line 229, in init_app
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/vllm/vllm/engine/async_llm_engine.py", line 470, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/vllm/vllm/engine/async_llm_engine.py", line 551, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/vllm/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/vllm/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/vllm/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:   File "/vllm/vllm/executor/multiproc_gpu_executor.py", line 195, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/vllm/vllm/worker/model_runner.py", line 935, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/vllm/vllm/worker/model_runner.py", line 1354, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
[rank0]:     vision_embeddings = self.vision_embed_tokens(
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
[rank0]:     img_features = self.get_img_features(pixel_values)
[rank0]:   File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
[rank0]:     img_feature = self.img_processor(img_embeds)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
[rank0]:     return self.vision_model(pixel_values=pixel_values)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
[rank0]:     hidden_states = self.encoder(inputs_embeds=hidden_states)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
[rank0]:     hidden_states = encoder_layer(hidden_states)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
[rank0]:     hidden_states, _ = self.self_attn(hidden_states=hidden_states)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
[rank0]:     attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU
ERROR 07-30 18:05:25 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 690777 died, exit code: -15
INFO 07-30 18:05:25 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '[rank0]:   File "/vllm/vllm/model_executor/models/phi3v.py", line 529, in forward
[rank0]:     vision_embeddings = self.vision_embed_tokens(
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/phi3v.py", line 163, in forward
[rank0]:     img_features = self.get_img_features(pixel_values)
[rank0]:   File "/vllm/vllm/model_executor/models/phi3v.py", line 87, in get_img_features
[rank0]:     img_feature = self.img_processor(img_embeds)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 289, in forward
[rank0]:     return self.vision_model(pixel_values=pixel_values)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 267, in forward
[rank0]:     hidden_states = self.encoder(inputs_embeds=hidden_states)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 235, in forward
[rank0]:     hidden_states = encoder_layer(hidden_states)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/vllm/vllm/model_executor/models/clip.py", line 195, in forward
[rank0]:     hidden_states, _ = self.self_attn(hidden_states=hidden_states)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 280, in forward
[rank0]:     attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.27 GiB. GPU
ERROR 07-30 18:05:25 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 690777 died, exit code: -15
INFO 07-30 18:05:25 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

My environment

PyTorch version: 2.3.1

OS: Ubuntu 22.04.4 LTS
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
CMake version: version 3.30.1
Libc version: (Ubuntu GLIBC 2.35-0ubuntu3.8)

Python version: 3.10.12
Is CUDA available: True
CUDA runtime version: 12.4

Nvidia driver version: 550.54.14

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
Neuron SDK Version: N/A
vLLM Version: -e git+https://github.com/vllm-project/vllm.git@c66c7f86aca956014d9ec6cc7a3e6001037e4655#egg=vllm

vllm-project / vllm

[Bug]: LLM is not getting loaded on multiple GPUs but works fine on a single GPU #3974

Your current environment

🐛 Describe the bug