vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.48k stars 4.62k forks source link

[Bug]: Llama-3.2-11B-Vision-Instruct server crashes when asked guided generation #8952

Closed miridih-jhkim11 closed 3 weeks ago

miridih-jhkim11 commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.26.1 Libc version: glibc-2.31 Python version: 3.8.10 (default, Jul 29 2024, 17:02:10) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 1500.000 CPU max MHz: 3529.0520 CPU min MHz: 1500.0000 BogoMIPS: 4899.71 Virtualization: AMD-V L1d cache: 4 MiB L1i cache: 4 MiB L2 cache: 64 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.1 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.dev28+gb0298aa8 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-63,128-191 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks``` ```

Model Input Dumps

No response

πŸ› Describe the bug

I am serving Llama-3.2-11B-Vision-Instruct on my 1 A100/80G GPU with the following instruction.

nohup vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --port 8000 --api-key qwen
2-4e1fbc5e56f7fbE1 --gpu-memory-utilization 0.9 --download_dir /workspace/vllm_models/ --cpu-offload-gb 5000 --swap-space 50 --max-model-le
n 4096 --max_num_seqs=32 --enforce_eager > llama_vision-output_240930.log 2>&1 &

The server crashes whenever I add response_format or guided_json parameter in my client.chat.completions.create() method.

D0 Inference

If structured output is not used, it crashes when trying to use 40 seconds

import asyncio
import nest_asyncio
from openai import AsyncOpenAI

# Allow nested event loops
nest_asyncio.apply()

async def run_batch_requests(img_urls, image_info_prompt):
    client = AsyncOpenAI(
        base_url=llama_v_url,
        api_key=llama_v_api_key,
    )
    model = llama_v_model

    async def single_request(img_url, image_info_prompt):
        messages = [
            # {"role": "system", "content": d0_system_prompt}, 
            {"role": "user", "content": [{"type": "text", "text": image_info_prompt}, # Error: Prompting with images is incompatible with system messages.
                {"type": "image_url", "image_url": {"url": img_url}}]
                }
        ]

        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000,
                # response_format=dict(type="json_object") # Crashes
                extra_body=dict(guided_json=d0_schema) # Crashes when applied to llama model
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Error:{str(e)}"

    tasks = [single_request(url, image_info_prompt) for url in img_urls]
    results = await asyncio.gather(*tasks)

    return results

# Example usage
img_urls = images

# Run the asynchronous function synchronously
results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, image_info_prompt, is_llama_v=True))

Before submitting a new issue...

heheda12345 commented 1 month ago

Can you provide a runable client-side script? This code misses a lot of variables.

miridih-jhkim11 commented 1 month ago

@heheda12345 The following is the running script. If I uncomment "response_format", the server crashes.

import asyncio
import nest_asyncio
from openai import AsyncOpenAI

nest_asyncio.apply()

async def run_batch_requests(img_urls, user_prompt_template):
    client = AsyncOpenAI(
        base_url=[url],
        api_key=[api_key] ,
    )
    model = "meta-llama/Llama-3.2-11B-Vision-Instruct"

    async def single_request(img_url):
        messages = [

            {"role": "user", "content": [
                {"type": "text", "text": user_prompt_template},
                {"type": "image_url", "image_url": {"url": img_url}}
            ]}
        ]

        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000,
                temperature=0.1,
                # response_format=dict(type="json_object")
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"이미지 처리 쀑 였λ₯˜ λ°œμƒ {img_url}: {str(e)}"

    tasks = [single_request(url) for url in img_urls]
    results = await asyncio.gather(*tasks)

    return results

# μ‚¬μš© μ˜ˆμ‹œ
img_urls = image_urls = [
    "https://unsplash.com/photos/8manzosDSGM/download?force=true",
    "https://unsplash.com/photos/yC-Yzbqy7PY/download?force=true",
    "https://unsplash.com/photos/82TpEld0_e4/download?force=true",
    "https://unsplash.com/photos/wawEfYdpkag/download?force=true",
    "https://unsplash.com/photos/xMSxY4WWQkE/download?force=true",
    "https://unsplash.com/photos/hpjSkU2UYSU/download?force=true",
    "https://unsplash.com/photos/TkXJoA_sn1w/download?force=true",
    "https://unsplash.com/photos/q54Oxq44MZs/download?force=true",
    "https://unsplash.com/photos/8mikJ83LmSQ/download?force=true",
    "https://unsplash.com/photos/CAm0Ht0rBMw/download?force=true"
]
user_prompt_template ="""
[INSTRUCTIONS]

1. **Describe the given image in detail.**

   - Provide a comprehensive description of the image, including all relevant details such as objects, scenes, actions, colors, textures, and any other notable elements.
   - Be as specific as possible to capture the essence of the image.

2. **Indicate the types and counts of objects appearing in the given image in JSON format.**

   - Use the following schema for the JSON output:

     ```json
     {
       "image_info": "<Detailed description of the image>",
       "object": {
         "object_1": count,
         "object_2": count,
         ...
       }
     }
 - Replace `<Detailed description of the image>` with the description from step 1.
 - List each object type as a key in the `"object"` dictionary, with the corresponding count as the value.
 - Ensure that object names are clear and consistent (e.g., "tree", "person", "car").

[NOTE]

Example Output:

{
  "image_info": "A serene beach scene at sunset with two palm trees silhouetted against the orange sky, gentle waves lapping at the shore, and a small boat anchored near the horizon.",
  "object": {
    "palm tree": 2,
    "boat": 1,
    "wave": 5
  }
}
"""

results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, user_prompt_template))

results
pavlo-ruban commented 1 month ago

So the issue is that llama-3.2-vision models have this extra token <|image|> with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here: https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82 You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.

miridih-jhkim11 commented 1 month ago

@pavlo-ruban

You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.

Thanks to your advice, I add this line of code.

        allowed_tokens = [token for token in allowed_tokens if token != 128256]
        mask[allowed_tokens] = 0

Is is correct way to do it? The server doesn't crash, but I couldn't get structured output.

pavlo-ruban commented 1 month ago

@miridih-jhkim11 I went with allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]], ran into graph problem when trying to compare the token like you are doing, it was to do with graph, something failed along self.cuda_graph.capture_end(). Do you mean you are getting the response, but not structured, or getting a structurd response with empty values?

Jason-CKY commented 1 month ago

So the issue is that llama-3.2-vision models have this extra token <|image|> with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here:

https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82

You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.

thanks for this, it worked with the following patch

allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]]
mask[allowed_tokens] = 0

i think you need to run --enforce-eager so that cuda graphs are not compiled

mil-ad commented 1 month ago

I'm having a similar (ish) problem with json output. When I add "response_format": {"type": "json_object"} to my request vllm crashes with this stack trace:

INFO 10-16 13:24:32 engine.py:288] Added request chat-bc566e62dd194175af7fe161cad40248.
Compiling FSM index for all state transitions: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 10.86it/s]
INFO 10-16 13:24:37 metrics.py:351] Avg prompt throughput: 240.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
Compiling FSM index for all state transitions: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:00<00:00, 12.84it/s]
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
CRITICAL 10-16 13:24:39 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     100.81.49.94:64400 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-16 13:24:39 engine.py:157] RuntimeError('CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 10-16 13:24:39 engine.py:157] Traceback (most recent call last):
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-16 13:24:39 engine.py:157]     self.run_engine_loop()
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-16 13:24:39 engine.py:157]     request_outputs = self.engine_step()
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-16 13:24:39 engine.py:157]     raise e
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-16 13:24:39 engine.py:157]     return self.engine.step()
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1387, in step
ERROR 10-16 13:24:39 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 10-16 13:24:39 engine.py:157]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 10-16 13:24:39 engine.py:157]     return self.driver_worker.execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-16 13:24:39 engine.py:157]     output = self.model_runner.execute_model(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-16 13:24:39 engine.py:157]     return func(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/enc_dec_model_runner.py", line 225, in execute_
model
ERROR 10-16 13:24:39 engine.py:157]     output: SamplerOutput = self.model.sample(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/models/mllama.py", line 940, in sample
ERROR 10-16 13:24:39 engine.py:157]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-16 13:24:39 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-16 13:24:39 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
ERROR 10-16 13:24:39 engine.py:157]     maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
ERROR 10-16 13:24:39 engine.py:157]     return _sample_with_torch(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 848, in _sample_with_torch
ERROR 10-16 13:24:39 engine.py:157]     return get_pythonized_sample_results(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 713, in get_pythonized_sample_results
ERROR 10-16 13:24:39 engine.py:157]     sample_results = _random_sample(seq_groups,
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 512, in _random_sample
ERROR 10-16 13:24:39 engine.py:157]     random_samples = random_samples.cpu()
ERROR 10-16 13:24:39 engine.py:157] RuntimeError: CUDA error: device-side assert triggered
ERROR 10-16 13:24:39 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 10-16 13:24:39 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 10-16 13:24:39 engine.py:157] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 10-16 13:24:39 engine.py:157] 
[rank0]:[E1016 13:24:39.286342262 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x14b5e76daa84 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2860211]

Does this ring any bells?

I'm using vllm-0.6.3.dev155+gf3a507f1.d20241010-cp38-abi3-manylinux1_x86_64.whl btw

tjohnson31415 commented 3 weeks ago

Thanks @pavlo-ruban for finding the root cause and providing a fix! I just created a quick PR with your fix so that we can get this working without manual patching https://github.com/vllm-project/vllm/pull/9631.