vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.75k stars 3.91k forks source link

[Bug]: distributed inference for vl model crashed(so slow that the connection closed) #6894

Open JerryLu991223 opened 1 month ago

JerryLu991223 commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.31

Python version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.14-weride-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1660 SUPER
Nvidia driver version: 525.147.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           167
Model name:                      Intel(R) Xeon(R) W-1350P @ 4.00GHz
Stepping:                        1
CPU MHz:                         4450.086
CPU max MHz:                     5100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        8016.00
Virtualization:                  VT-x
L1d cache:                       288 KiB
L1i cache:                       192 KiB
L2 cache:                        3 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.4
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] transformers              4.42.4                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.4                    pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-11            N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

import os
import torch
import time
from io import BytesIO
from PIL import Image
import requests
import gradio as gr
from vllm import LLM, SamplingParams

model_path = "/workspace/models/phi-3"
llm = None

def load_model():
    start = time.time()
    global llm
    llm = LLM(
        model=model_path, max_model_len=4096, tensor_parallel_size=4, enforce_eager=True, disable_custom_all_reduce=True,
        gpu_memory_utilization=0.7, dtype=torch.float16, trust_remote_code=True)
    print("Successfully load model!")
    print(f"Load model cost: {time.time() - start} sec")

def predict(prompt, image):

    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.95,
        max_tokens=100)

    outputs = llm.generate(
        {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image
            },
        },
        sampling_params=sampling_params)

    generated_text = ""
    for o in outputs:
        generated_text += o.outputs[0].text
    return generated_text

if __name__ == "__main__":
    load_model()
    prompt="[INST] <image>\nWhat is shown in this image? [/INST]"
    prompt = "How are you today?"
    image_url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
    image = Image.open(BytesIO(requests.get(image_url).content))
    output = predict(prompt, image)
    print(output)

When I run the script, I get normal log which indicates that I have loaded the model and begin to inference with GPU 100% utilization.


(RayWorkerWrapper pid=1392, ip=10.31.100.187) INFO 07-29 09:13:12 model_runner.py:255] Loading model weights took 2.1688 GB [repeated 2x across cluster]
administrator-ThinkStation-P350:5018:5173 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [receive] via NET/Socket/0/Shared
administrator-ThinkStation-P350:5018:5173 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [receive] via NET/Socket/0/Shared
INFO 07-29 09:22:30 distributed_gpu_executor.py:56] # GPU blocks: 584, # CPU blocks: 2730
Successfully load model!
Load model cost: 451.7983124256134 sec
Processed prompts:   0%|                                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

But after that, I have waited a long period of time and finally get the error below

[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=176, OpType=ALLREDUCE, NumelIn=3072, NumelOut=3072, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 0] Timeout at NCCL work: 176, last enqueued NCCL work: 176, last completed NCCL work: 175.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=176, OpType=ALLREDUCE, NumelIn=3072, NumelOut=3072, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f50d8591897 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f50d986c1b2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f50d9870fd0 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f50d987231c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f5125329bf4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f51264f2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f51262bd353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-07-29 08:05:22,129 E 4784 4918] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=176, OpType=ALLREDUCE, NumelIn=3072, NumelOut=3072, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f50d8591897 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f50d986c1b2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f50d9870fd0 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f50d987231c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f5125329bf4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f51264f2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f51262bd353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f50d8591897 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f50d94f4e33 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdbbf4 (0x7f5125329bf4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f51264f2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f51262bd353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-07-29 08:05:22,154 E 4784 4918] logging.cc:115: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x109aaea) [0x7f505135caea] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0x109dd72) [0x7f505135fd72] ray::TerminateHandler()
/opt/conda/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f51252ff35a] __cxxabiv1::__terminate()
/opt/conda/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f51252ff3c5]
/opt/conda/bin/../lib/libstdc++.so.6(+0xb134f) [0x7f51252ff34f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe32ee4) [0x7f50d94f4ee4] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7f5125329bf4] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f51264f2609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f51262bd353] __clone

*** SIGABRT received at time=1722240322 on cpu 0 ***
PC: @     0x7f51261e100b  (unknown)  raise
    @     0x7f51264fe420       5072  (unknown)
    @     0x7f51252ff35a  (unknown)  __cxxabiv1::__terminate()
    @     0x7f51252ff070  (unknown)  (unknown)
[2024-07-29 08:05:22,154 E 4784 4918] logging.cc:440: *** SIGABRT received at time=1722240322 on cpu 0 ***
[2024-07-29 08:05:22,154 E 4784 4918] logging.cc:440: PC: @     0x7f51261e100b  (unknown)  raise
[2024-07-29 08:05:22,154 E 4784 4918] logging.cc:440:     @     0x7f51264fe420       5072  (unknown)
[2024-07-29 08:05:22,154 E 4784 4918] logging.cc:440:     @     0x7f51252ff35a  (unknown)  __cxxabiv1::__terminate()
[2024-07-29 08:05:22,155 E 4784 4918] logging.cc:440:     @     0x7f51252ff070  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, _cffi_backend, zstandard.backend_c, yaml._yaml, websockets.speedups, markupsafe._speedups, PIL._imagingmath, PIL._webp, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pyarrow.lib, pyarrow._json, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, PIL._imagingft (total: 96)

I can normally do the inference with language model, but when it comes to vision language model, then I meet with this error. May I get any help and suggestions?

DarkLight1337 commented 1 month ago

Which HF repo did you originally get the model from?

JerryLu991223 commented 1 month ago

Which HF repo did you originally get the model from?

microsoft/Phi-3-vision-128k-instruct, https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/tree/main. Also, I meet with similar situation on llava-v1.6, so I suppose it is not the model that leads to the error.

DarkLight1337 commented 1 month ago

Can you try it on a smaller model like LLaVA-1.5 and see if you get the same problem?

JerryLu991223 commented 1 month ago

Can you try it on a smaller model like LLaVA-1.5 and see if you get the same problem?

But I find that in LLava-v1.5 collection, the smallest model is 7B and this one is 4B. Do you mean https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf?

DarkLight1337 commented 1 month ago

Sorry, I mean lighter weight, not necessarily in terms of number of parameters. I suggested LLaVA-1.5 because it doesn't crop the image into patches before inputting them into the model, so it should take less resources to run it.

JerryLu991223 commented 1 month ago

Sorry, I mean lighter weight, not necessarily in terms of number of parameters. I suggested LLaVA-1.5 because it doesn't crop the image into patches before inputting them into the model, so it should take less resources to run it.

It seems I meet with the same problem. Also, for phi-3(4B), it cost 450 sec to load model while it cost only 10sec to load model on a single node. Thus, I think there should be some errors that greatly slow down the loading and inference.

DarkLight1337 commented 1 month ago

I have updated the distributed broadcasting logic in #6836. See if it helps.

DarkLight1337 commented 1 month ago

Regarding slow model loading, please follow this guide.