Closed pseudotensor closed 2 months ago
Tried this:
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5000:5000 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5000 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--max-model-len=32768 \
--max-num-batched-tokens=512 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=4' \
--max_num_batched_tokens 32768
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
that gives:
NFO 09-11 22:59:32 model_runner.py:1008] Loading model weights took 23.6259 GB
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 338, in __init__
self._initialize_kv_caches()
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 467, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1216, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
inputs_embeds = merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
assert (seq_len == N_txt +
^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 33280 should be equal to N_txt + N_img (512, 16384, 32768)
If I don't pass any max model len, it says my kv cache can do 75552, but then I try this:
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5000:5000 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5000 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--max-num-batched-tokens=512 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=4' \
--max-model-len=75552 \
--max_num_batched_tokens 75552
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
and I get:
AssertionError: Expected at least 18888 dummy tokens for profiling, but found 16640 tokens instead.
Ok so this doesn't crash on startup:
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--max-num-batched-tokens=512 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=4' \
--max-model-len=75552 \
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
but it complains that I only have 1 sequence length.
WARNING 09-11 23:07:25 model_runner.py:1176] Computed max_num_seqs (min(256, 512 // 16384)) to be less than 1. Setting it to the minimum value of 1.
which I'm not sure how to interpret vs. the kv cache estimate.
@pseudotensor Can you try these settings?
--model=mistralai/Pixtral-12B-2409 \
--tensor-parallel-size=1 \
--max-num-batched-tokens=16384\
--gpu-memory-utilization 0.98 \
--tokenizer-mode mistral \
--limit-mm-per-prompt 'image=4' \
The last version I shared worked, but ya probably I can reduce max model len down so get more sequences.
I spoke too soon, now there's runtime falure when I send an image:
Future exception was never retrieved
future: <Future finished exception=AssertionError('seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)')>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
output = await make_async(self.driver_worker.execute_model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
inputs_embeds = merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
assert (seq_len == N_txt +
^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)
I spoke too soon, now there's runtime falure when I send an image:
Future exception was never retrieved future: <Future finished exception=AssertionError('seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)')> Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate async for request_output in results_generator: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate async for output in await self.add_request( File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator raise result File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion return_value = task.result() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop result = task.result() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step request_outputs = await self.engine.step_async(virtual_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async outputs = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async output = await make_async(self.driver_worker.execute_model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model output = self.model_runner.execute_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward inputs_embeds = merge_multimodal_embeddings( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings assert (seq_len == N_txt + ^^^^^^^^^^^^^^^^^^ AssertionError: seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)
Yea - this is exactly why I asked if you could try the command I sent. Basically, we need to ensure each batch is bigger than image feature size since images cannot be partially "prefilled"
Ok. I'll try your version.
Was just trying this version and still same problem but slightly different error:
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=4' \
--max-model-len=75552 \
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
gives:
| AssertionError: seq_len 4096 should be equal to N_txt + N_img (64, 4096, 4032)
If I don't specify the max model len then I get the kv cache error, since the model is trying to do 128k otherwise and won't fit into 40GB.
If I don't specify the max model len then I get the kv cache error, since the model is trying to do 128k otherwise and won't fit into 40GB.
Hmm what if you set --max-num-seqs
to 1? Would that fit?
This works for launch and runtime for 1 image at least:
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=4' \
--max-model-len=16384 \
--max-num-batched-tokens=16384 \
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
Do I have to choose special values like 16384, is the kv-cache limit version of 75552 not valid? I'll try 32k.
replacing 16k above with 32k failed with this:
AssertionError: seq_len 33280 should be equal to N_txt + N_img (512, 16384, 32768)
So how do I utilize more tokens up to the kv cache limit of 75552?
If I scale-up to 32k while increasing the max images to 8, then it seems to work again for launch and runtime (with 1 image so far):
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=8' \
--max-model-len=32768 \
--max-num-batched-tokens=32768 \
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
For 64k and 16 images, I get kv cache issue of 9424 being max and failure.
So I guess while one has to increase the image limit and token limit together, there's no way to decouple them and (say) have alot of text tokens and not so many images.
No, still happening. Even with what seems like stable setup, it fails like:
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
inputs_embeds = merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
assert (seq_len == N_txt +
^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 25254 should be equal to N_txt + N_img (806, 12224, 24448)
We found out this is due to the fact that chunked prefill currently does not well with VLMs. Chunked prefill was previously turned on by default for long context length but now disabled by default from #8425.
Could you try upgrading to the latest vLLM version? With the new version, you won’t need to specify max_num_batched_tokens, but make sure max_model_len > num_images * 4096.
Did you mean max_model_len >= num_images * 4096
?
Did you mean
max_model_len >= num_images * 4096
?
Basically in the worst case, a request can have max number of images allowed by the engine (which is specified by --limit-mm-per-prompt
and each image requires maximum number of placeholder tokens (4096), thus the whole final sequence will have the length of (# of text prompt tokens + # of all image placeholder tokens). If the final sequence is longer than max_model_len
, then this prompt will never be accepted by the engine.
But it should work now that I have longer max model length than the 8*4096, because I had issues with that before. Will try.
So far so good, let's see how goes. Thanks!
Using this for half of 80GB H100:
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=8' \
--max-model-len=49152 \
--max-log-len=100 \
I'll close for now until see issues.
Your current environment
H100 40GB
Model Input Dumps
No response
🐛 Describe the bug
leads to:
Before submitting a new issue...