Closed miridih-jhkim11 closed 3 weeks ago
Can you provide a runable client-side script? This code misses a lot of variables.
@heheda12345 The following is the running script. If I uncomment "response_format", the server crashes.
import asyncio
import nest_asyncio
from openai import AsyncOpenAI
nest_asyncio.apply()
async def run_batch_requests(img_urls, user_prompt_template):
client = AsyncOpenAI(
base_url=[url],
api_key=[api_key] ,
)
model = "meta-llama/Llama-3.2-11B-Vision-Instruct"
async def single_request(img_url):
messages = [
{"role": "user", "content": [
{"type": "text", "text": user_prompt_template},
{"type": "image_url", "image_url": {"url": img_url}}
]}
]
try:
response = await client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1000,
temperature=0.1,
# response_format=dict(type="json_object")
)
return response.choices[0].message.content
except Exception as e:
return f"μ΄λ―Έμ§ μ²λ¦¬ μ€ μ€λ₯ λ°μ {img_url}: {str(e)}"
tasks = [single_request(url) for url in img_urls]
results = await asyncio.gather(*tasks)
return results
# μ¬μ© μμ
img_urls = image_urls = [
"https://unsplash.com/photos/8manzosDSGM/download?force=true",
"https://unsplash.com/photos/yC-Yzbqy7PY/download?force=true",
"https://unsplash.com/photos/82TpEld0_e4/download?force=true",
"https://unsplash.com/photos/wawEfYdpkag/download?force=true",
"https://unsplash.com/photos/xMSxY4WWQkE/download?force=true",
"https://unsplash.com/photos/hpjSkU2UYSU/download?force=true",
"https://unsplash.com/photos/TkXJoA_sn1w/download?force=true",
"https://unsplash.com/photos/q54Oxq44MZs/download?force=true",
"https://unsplash.com/photos/8mikJ83LmSQ/download?force=true",
"https://unsplash.com/photos/CAm0Ht0rBMw/download?force=true"
]
user_prompt_template ="""
[INSTRUCTIONS]
1. **Describe the given image in detail.**
- Provide a comprehensive description of the image, including all relevant details such as objects, scenes, actions, colors, textures, and any other notable elements.
- Be as specific as possible to capture the essence of the image.
2. **Indicate the types and counts of objects appearing in the given image in JSON format.**
- Use the following schema for the JSON output:
```json
{
"image_info": "<Detailed description of the image>",
"object": {
"object_1": count,
"object_2": count,
...
}
}
- Replace `<Detailed description of the image>` with the description from step 1.
- List each object type as a key in the `"object"` dictionary, with the corresponding count as the value.
- Ensure that object names are clear and consistent (e.g., "tree", "person", "car").
[NOTE]
Example Output:
{
"image_info": "A serene beach scene at sunset with two palm trees silhouetted against the orange sky, gentle waves lapping at the shore, and a small boat anchored near the horizon.",
"object": {
"palm tree": 2,
"boat": 1,
"wave": 5
}
}
"""
results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, user_prompt_template))
results
So the issue is that llama-3.2-vision models have this extra token <|image|>
with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here: https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82
You have a tensor shaped (128256,)
, but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation).
I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.
@pavlo-ruban
You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.
Thanks to your advice, I add this line of code.
allowed_tokens = [token for token in allowed_tokens if token != 128256]
mask[allowed_tokens] = 0
Is is correct way to do it? The server doesn't crash, but I couldn't get structured output.
@miridih-jhkim11 I went with allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]]
, ran into graph problem when trying to compare the token like you are doing, it was to do with graph, something failed along self.cuda_graph.capture_end()
. Do you mean you are getting the response, but not structured, or getting a structurd response with empty values?
So the issue is that llama-3.2-vision models have this extra token
<|image|>
with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here:You have a tensor shaped
(128256,)
, but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.
thanks for this, it worked with the following patch
allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]]
mask[allowed_tokens] = 0
i think you need to run --enforce-eager
so that cuda graphs are not compiled
I'm having a similar (ish) problem with json output. When I add "response_format": {"type": "json_object"}
to my request vllm crashes with this stack trace:
INFO 10-16 13:24:32 engine.py:288] Added request chat-bc566e62dd194175af7fe161cad40248.
Compiling FSM index for all state transitions: 100%|ββββββββββ| 3/3 [00:00<00:00, 10.86it/s]
INFO 10-16 13:24:37 metrics.py:351] Avg prompt throughput: 240.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
Compiling FSM index for all state transitions: 100%|ββββββββββ| 7/7 [00:00<00:00, 12.84it/s]
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
CRITICAL 10-16 13:24:39 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO: 100.81.49.94:64400 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-16 13:24:39 engine.py:157] RuntimeError('CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 10-16 13:24:39 engine.py:157] Traceback (most recent call last):
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-16 13:24:39 engine.py:157] self.run_engine_loop()
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-16 13:24:39 engine.py:157] request_outputs = self.engine_step()
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-16 13:24:39 engine.py:157] raise e
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-16 13:24:39 engine.py:157] return self.engine.step()
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1387, in step
ERROR 10-16 13:24:39 engine.py:157] outputs = self.model_executor.execute_model(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 10-16 13:24:39 engine.py:157] driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 10-16 13:24:39 engine.py:157] return self.driver_worker.execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-16 13:24:39 engine.py:157] output = self.model_runner.execute_model(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-16 13:24:39 engine.py:157] return func(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/enc_dec_model_runner.py", line 225, in execute_
model
ERROR 10-16 13:24:39 engine.py:157] output: SamplerOutput = self.model.sample(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/models/mllama.py", line 940, in sample
ERROR 10-16 13:24:39 engine.py:157] next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-16 13:24:39 engine.py:157] return self._call_impl(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-16 13:24:39 engine.py:157] return forward_call(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
ERROR 10-16 13:24:39 engine.py:157] maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
ERROR 10-16 13:24:39 engine.py:157] return _sample_with_torch(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 848, in _sample_with_torch
ERROR 10-16 13:24:39 engine.py:157] return get_pythonized_sample_results(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 713, in get_pythonized_sample_results
ERROR 10-16 13:24:39 engine.py:157] sample_results = _random_sample(seq_groups,
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 512, in _random_sample
ERROR 10-16 13:24:39 engine.py:157] random_samples = random_samples.cpu()
ERROR 10-16 13:24:39 engine.py:157] RuntimeError: CUDA error: device-side assert triggered
ERROR 10-16 13:24:39 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 10-16 13:24:39 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 10-16 13:24:39 engine.py:157] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 10-16 13:24:39 engine.py:157]
[rank0]:[E1016 13:24:39.286342262 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x14b5e76daa84 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [2860211]
Does this ring any bells?
I'm using vllm-0.6.3.dev155+gf3a507f1.d20241010-cp38-abi3-manylinux1_x86_64.whl
btw
Thanks @pavlo-ruban for finding the root cause and providing a fix! I just created a quick PR with your fix so that we can get this working without manual patching https://github.com/vllm-project/vllm/pull/9631.
Your current environment
The output of `python collect_env.py`
```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.26.1 Libc version: glibc-2.31 Python version: 3.8.10 (default, Jul 29 2024, 17:02:10) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 1500.000 CPU max MHz: 3529.0520 CPU min MHz: 1500.0000 BogoMIPS: 4899.71 Virtualization: AMD-V L1d cache: 4 MiB L1i cache: 4 MiB L2 cache: 64 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.1 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.dev28+gb0298aa8 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-63,128-191 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks``` ```Model Input Dumps
No response
π Describe the bug
I am serving Llama-3.2-11B-Vision-Instruct on my 1 A100/80G GPU with the following instruction.
The server crashes whenever I add response_format or guided_json parameter in my client.chat.completions.create() method.
D0 Inference
If structured output is not used, it crashes when trying to use 40 seconds
Before submitting a new issue...