Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.26.1 Libc version: glibc-2.31 Python version: 3.8.10 (default, Jul 29 2024, 17:02:10) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 1500.000 CPU max MHz: 3529.0520 CPU min MHz: 1500.0000 BogoMIPS: 4899.71 Virtualization: AMD-V L1d cache: 4 MiB L1i cache: 4 MiB L2 cache: 64 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.1 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.dev28+gb0298aa8 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-63,128-191 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks``` ```

Model Input Dumps

No response

🐛 Describe the bug

I am serving Llama-3.2-11B-Vision-Instruct on my 1 A100/80G GPU with the following instruction.

nohup vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --port 8000 --api-key qwen
2-4e1fbc5e56f7fbE1 --gpu-memory-utilization 0.9 --download_dir /workspace/vllm_models/ --cpu-offload-gb 5000 --swap-space 50 --max-model-le
n 4096 --max_num_seqs=32 --enforce_eager > llama_vision-output_240930.log 2>&1 &

The server crashes whenever I add response_format or guided_json parameter in my client.chat.completions.create() method.

D0 Inference

If structured output is not used, it crashes when trying to use 40 seconds

import asyncio
import nest_asyncio
from openai import AsyncOpenAI

# Allow nested event loops
nest_asyncio.apply()

async def run_batch_requests(img_urls, image_info_prompt):
    client = AsyncOpenAI(
        base_url=llama_v_url,
        api_key=llama_v_api_key,
    )
    model = llama_v_model

    async def single_request(img_url, image_info_prompt):
        messages = [
            # {"role": "system", "content": d0_system_prompt}, 
            {"role": "user", "content": [{"type": "text", "text": image_info_prompt}, # Error: Prompting with images is incompatible with system messages.
                {"type": "image_url", "image_url": {"url": img_url}}]
                }
        ]

        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000,
                # response_format=dict(type="json_object") # Crashes
                extra_body=dict(guided_json=d0_schema) # Crashes when applied to llama model
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Error:{str(e)}"

    tasks = [single_request(url, image_info_prompt) for url in img_urls]
    results = await asyncio.gather(*tasks)

    return results

# Example usage
img_urls = images

# Run the asynchronous function synchronously
results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, image_info_prompt, is_llama_v=True))

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Can you provide a runable client-side script? This code misses a lot of variables.

@heheda12345 The following is the running script. If I uncomment "response_format", the server crashes.

import asyncio
import nest_asyncio
from openai import AsyncOpenAI

nest_asyncio.apply()

async def run_batch_requests(img_urls, user_prompt_template):
    client = AsyncOpenAI(
        base_url=[url],
        api_key=[api_key] ,
    )
    model = "meta-llama/Llama-3.2-11B-Vision-Instruct"

    async def single_request(img_url):
        messages = [

            {"role": "user", "content": [
                {"type": "text", "text": user_prompt_template},
                {"type": "image_url", "image_url": {"url": img_url}}
            ]}
        ]

        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000,
                temperature=0.1,
                # response_format=dict(type="json_object")
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"이미지 처리 중 오류 발생 {img_url}: {str(e)}"

    tasks = [single_request(url) for url in img_urls]
    results = await asyncio.gather(*tasks)

    return results

# 사용 예시
img_urls = image_urls = [
    "https://unsplash.com/photos/8manzosDSGM/download?force=true",
    "https://unsplash.com/photos/yC-Yzbqy7PY/download?force=true",
    "https://unsplash.com/photos/82TpEld0_e4/download?force=true",
    "https://unsplash.com/photos/wawEfYdpkag/download?force=true",
    "https://unsplash.com/photos/xMSxY4WWQkE/download?force=true",
    "https://unsplash.com/photos/hpjSkU2UYSU/download?force=true",
    "https://unsplash.com/photos/TkXJoA_sn1w/download?force=true",
    "https://unsplash.com/photos/q54Oxq44MZs/download?force=true",
    "https://unsplash.com/photos/8mikJ83LmSQ/download?force=true",
    "https://unsplash.com/photos/CAm0Ht0rBMw/download?force=true"
]
user_prompt_template ="""
[INSTRUCTIONS]

1. **Describe the given image in detail.**

   - Provide a comprehensive description of the image, including all relevant details such as objects, scenes, actions, colors, textures, and any other notable elements.
   - Be as specific as possible to capture the essence of the image.

2. **Indicate the types and counts of objects appearing in the given image in JSON format.**

   - Use the following schema for the JSON output:

     ```json
     {
       "image_info": "<Detailed description of the image>",
       "object": {
         "object_1": count,
         "object_2": count,
         ...
       }
     }

 - Replace `<Detailed description of the image>` with the description from step 1.
 - List each object type as a key in the `"object"` dictionary, with the corresponding count as the value.
 - Ensure that object names are clear and consistent (e.g., "tree", "person", "car").

[NOTE]

Make sure the JSON output is properly formatted and valid.
Only include objects that are clearly visible and identifiable in the image.
The counts should be integers representing the number of times each object appears in the image.
Do not include any additional text outside of the JSON output.
If unsure about an object, you may include it with a note in the description but avoid listing uncertain objects in the JSON.

Example Output:

{
  "image_info": "A serene beach scene at sunset with two palm trees silhouetted against the orange sky, gentle waves lapping at the shore, and a small boat anchored near the horizon.",
  "object": {
    "palm tree": 2,
    "boat": 1,
    "wave": 5
  }
}
"""

results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, user_prompt_template))

results

So the issue is that llama-3.2-vision models have this extra token <|image|> with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here: https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82 You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.

@pavlo-ruban

You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.

Thanks to your advice, I add this line of code.

        allowed_tokens = [token for token in allowed_tokens if token != 128256]
        mask[allowed_tokens] = 0

Is is correct way to do it? The server doesn't crash, but I couldn't get structured output.

@miridih-jhkim11 I went with allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]], ran into graph problem when trying to compare the token like you are doing, it was to do with graph, something failed along self.cuda_graph.capture_end(). Do you mean you are getting the response, but not structured, or getting a structurd response with empty values?

So the issue is that llama-3.2-vision models have this extra token <|image|> with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here:

https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82

You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.

thanks for this, it worked with the following patch

allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]]
mask[allowed_tokens] = 0

i think you need to run --enforce-eager so that cuda graphs are not compiled

I'm having a similar (ish) problem with json output. When I add "response_format": {"type": "json_object"} to my request vllm crashes with this stack trace:

INFO 10-16 13:24:32 engine.py:288] Added request chat-bc566e62dd194175af7fe161cad40248.
Compiling FSM index for all state transitions: 100%|██████████| 3/3 [00:00<00:00, 10.86it/s]
INFO 10-16 13:24:37 metrics.py:351] Avg prompt throughput: 240.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
Compiling FSM index for all state transitions: 100%|██████████| 7/7 [00:00<00:00, 12.84it/s]
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
CRITICAL 10-16 13:24:39 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     100.81.49.94:64400 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-16 13:24:39 engine.py:157] RuntimeError('CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 10-16 13:24:39 engine.py:157] Traceback (most recent call last):
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-16 13:24:39 engine.py:157]     self.run_engine_loop()
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-16 13:24:39 engine.py:157]     request_outputs = self.engine_step()
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-16 13:24:39 engine.py:157]     raise e
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-16 13:24:39 engine.py:157]     return self.engine.step()
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1387, in step
ERROR 10-16 13:24:39 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 10-16 13:24:39 engine.py:157]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 10-16 13:24:39 engine.py:157]     return self.driver_worker.execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-16 13:24:39 engine.py:157]     output = self.model_runner.execute_model(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-16 13:24:39 engine.py:157]     return func(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/enc_dec_model_runner.py", line 225, in execute_
model
ERROR 10-16 13:24:39 engine.py:157]     output: SamplerOutput = self.model.sample(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/models/mllama.py", line 940, in sample
ERROR 10-16 13:24:39 engine.py:157]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-16 13:24:39 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-16 13:24:39 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
ERROR 10-16 13:24:39 engine.py:157]     maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
ERROR 10-16 13:24:39 engine.py:157]     return _sample_with_torch(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 848, in _sample_with_torch
ERROR 10-16 13:24:39 engine.py:157]     return get_pythonized_sample_results(
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 713, in get_pythonized_sample_results
ERROR 10-16 13:24:39 engine.py:157]     sample_results = _random_sample(seq_groups,
ERROR 10-16 13:24:39 engine.py:157]   File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 512, in _random_sample
ERROR 10-16 13:24:39 engine.py:157]     random_samples = random_samples.cpu()
ERROR 10-16 13:24:39 engine.py:157] RuntimeError: CUDA error: device-side assert triggered
ERROR 10-16 13:24:39 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 10-16 13:24:39 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 10-16 13:24:39 engine.py:157] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 10-16 13:24:39 engine.py:157] 
[rank0]:[E1016 13:24:39.286342262 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x14b5e76daa84 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2860211]

Does this ring any bells?

I'm using vllm-0.6.3.dev155+gf3a507f1.d20241010-cp38-abi3-manylinux1_x86_64.whl btw

Thanks @pavlo-ruban for finding the root cause and providing a fix! I just created a quick PR with your fix so that we can get this working without manual patching https://github.com/vllm-project/vllm/pull/9631.

vllm-project / vllm

[Bug]: Llama-3.2-11B-Vision-Instruct server crashes when asked guided generation #8952