vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.3k stars 4.75k forks source link

[Bug]: Pixtral leads to Expected at least 18286 dummy tokens for profiling, but found 16640 tokens instead or seq_len 25254 should be equal to N_txt + N_img (806, 12224, 24448) #8400

Closed pseudotensor closed 2 months ago

pseudotensor commented 2 months ago

Your current environment

H100 40GB

Model Input Dumps

No response

🐛 Describe the bug

docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5000:5000 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5000 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --max-model-len=128000 \
        --max-num-batched-tokens=512 \
        --gpu-memory-utilization 0.98 \
        --enable_chunked_prefill=True \
        --enable-chunked-prefill=True \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=4' \
        --max_num_batched_tokens 128000
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

leads to:

ERROR 09-11 22:55:06 api_server.py:188] RPCServer process died before responding to readiness probe
INFO 09-11 22:55:10 api_server.py:495] vLLM API server version 0.6.1
INFO 09-11 22:55:10 api_server.py:496] args: Namespace(host='0.0.0.0', port=5000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='mistralai/Pixtral-12B-2409', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='mistral', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=128000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=1234, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.98, num_gpu_blocks_override=None, max_num_batched_tokens=128000, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 4}, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-11 22:55:10 config.py:1646] Downcasting torch.float32 to torch.float16.
INFO 09-11 22:55:10 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/cdd08172-ad37-48ba-bfeb-9d71b6412105 for RPC Path.
INFO 09-11 22:55:10 api_server.py:178] Started engine process with PID 78
INFO 09-11 22:55:13 config.py:1646] Downcasting torch.float32 to torch.float16.
INFO 09-11 22:55:13 config.py:1006] Chunked prefill is enabled with max_num_batched_tokens=128000.
WARNING 09-11 22:55:13 config.py:383] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 09-11 22:55:13 llm_engine.py:232] Initializing an LLM engine (v0.6.1) with config: model='mistralai/Pixtral-12B-2409', speculative_config=None, tokenizer='mistralai/Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=mistralai/Pixtral-12B-2409, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
INFO 09-11 22:55:14 model_runner.py:997] Starting to load model mistralai/Pixtral-12B-2409...
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-11 22:55:15 weight_utils.py:242] Using model weights format ['*.safetensors']
INFO 09-11 22:55:15 weight_utils.py:287] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.39s/it]

INFO 09-11 22:55:22 model_runner.py:1008] Loading model weights took 23.6259 GB
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
             ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 338, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 467, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1188, in profile_run
    .dummy_data_for_profiling(self.model_config,
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/inputs/registry.py", line 195, in dummy_data_for_profiling
    assert len(num_tokens) >= seq_len, (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Expected at least 18286 dummy tokens for profiling, but found 16640 tokens instead.
ERROR 09-11 22:55:25 api_server.py:188] RPCServer process died before responding to readiness probe
(base) ubuntu@compute-permanent-node-171:~/vllm$ 

Before submitting a new issue...

pseudotensor commented 2 months ago

Tried this:

docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5000:5000 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5000 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --max-model-len=32768 \
        --max-num-batched-tokens=512 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=4' \
        --max_num_batched_tokens 32768
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

that gives:

NFO 09-11 22:59:32 model_runner.py:1008] Loading model weights took 23.6259 GB
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
             ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 338, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 467, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1216, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
    inputs_embeds = merge_multimodal_embeddings(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
    assert (seq_len == N_txt +
            ^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 33280 should be equal to N_txt + N_img (512, 16384, 32768)
pseudotensor commented 2 months ago

If I don't pass any max model len, it says my kv cache can do 75552, but then I try this:

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5000:5000 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5000 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --max-num-batched-tokens=512 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=4' \
        --max-model-len=75552 \
        --max_num_batched_tokens 75552
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

and I get:

AssertionError: Expected at least 18888 dummy tokens for profiling, but found 16640 tokens instead.
pseudotensor commented 2 months ago

Ok so this doesn't crash on startup:

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --max-num-batched-tokens=512 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=4' \
        --max-model-len=75552 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

but it complains that I only have 1 sequence length.

WARNING 09-11 23:07:25 model_runner.py:1176] Computed max_num_seqs (min(256, 512 // 16384)) to be less than 1. Setting it to the minimum value of 1.

which I'm not sure how to interpret vs. the kv cache estimate.

ywang96 commented 2 months ago

@pseudotensor Can you try these settings?

        --model=mistralai/Pixtral-12B-2409 \
        --tensor-parallel-size=1 \
        --max-num-batched-tokens=16384\
        --gpu-memory-utilization 0.98 \
        --tokenizer-mode mistral \
        --limit-mm-per-prompt 'image=4' \
pseudotensor commented 2 months ago

The last version I shared worked, but ya probably I can reduce max model len down so get more sequences.

pseudotensor commented 2 months ago

I spoke too soon, now there's runtime falure when I send an image:

Future exception was never retrieved
future: <Future finished exception=AssertionError('seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
    inputs_embeds = merge_multimodal_embeddings(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
    assert (seq_len == N_txt +
            ^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)
ywang96 commented 2 months ago

I spoke too soon, now there's runtime falure when I send an image:

Future exception was never retrieved
future: <Future finished exception=AssertionError('seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
    inputs_embeds = merge_multimodal_embeddings(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
    assert (seq_len == N_txt +
            ^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 512 should be equal to N_txt + N_img (9, 4096, 503)

Yea - this is exactly why I asked if you could try the command I sent. Basically, we need to ensure each batch is bigger than image feature size since images cannot be partially "prefilled"

pseudotensor commented 2 months ago

Ok. I'll try your version.

Was just trying this version and still same problem but slightly different error:

docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=4' \
        --max-model-len=75552 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

gives:

    | AssertionError: seq_len 4096 should be equal to N_txt + N_img (64, 4096, 4032)
pseudotensor commented 2 months ago

If I don't specify the max model len then I get the kv cache error, since the model is trying to do 128k otherwise and won't fit into 40GB.

ywang96 commented 2 months ago

If I don't specify the max model len then I get the kv cache error, since the model is trying to do 128k otherwise and won't fit into 40GB.

Hmm what if you set --max-num-seqs to 1? Would that fit?

pseudotensor commented 2 months ago

This works for launch and runtime for 1 image at least:

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=4' \
        --max-model-len=16384 \
        --max-num-batched-tokens=16384 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

Do I have to choose special values like 16384, is the kv-cache limit version of 75552 not valid? I'll try 32k.

pseudotensor commented 2 months ago

replacing 16k above with 32k failed with this:

AssertionError: seq_len 33280 should be equal to N_txt + N_img (512, 16384, 32768)
pseudotensor commented 2 months ago

So how do I utilize more tokens up to the kv cache limit of 75552?

pseudotensor commented 2 months ago

If I scale-up to 32k while increasing the max images to 8, then it seems to work again for launch and runtime (with 1 image so far):

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=8' \
        --max-model-len=32768 \
        --max-num-batched-tokens=32768 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
pseudotensor commented 2 months ago

For 64k and 16 images, I get kv cache issue of 9424 being max and failure.

So I guess while one has to increase the image limit and token limit together, there's no way to decouple them and (say) have alot of text tokens and not so many images.

pseudotensor commented 2 months ago

No, still happening. Even with what seems like stable setup, it fails like:

  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1543, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 178, in forward
    inputs_embeds = merge_multimodal_embeddings(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
    assert (seq_len == N_txt +
            ^^^^^^^^^^^^^^^^^^
AssertionError: seq_len 25254 should be equal to N_txt + N_img (806, 12224, 24448)
ywang96 commented 2 months ago

We found out this is due to the fact that chunked prefill currently does not well with VLMs. Chunked prefill was previously turned on by default for long context length but now disabled by default from #8425.

Could you try upgrading to the latest vLLM version? With the new version, you won’t need to specify max_num_batched_tokens, but make sure max_model_len > num_images * 4096.

pseudotensor commented 2 months ago

Did you mean max_model_len >= num_images * 4096 ?

ywang96 commented 2 months ago

Did you mean max_model_len >= num_images * 4096 ?

Basically in the worst case, a request can have max number of images allowed by the engine (which is specified by --limit-mm-per-prompt and each image requires maximum number of placeholder tokens (4096), thus the whole final sequence will have the length of (# of text prompt tokens + # of all image placeholder tokens). If the final sequence is longer than max_model_len, then this prompt will never be accepted by the engine.

pseudotensor commented 2 months ago

But it should work now that I have longer max model length than the 8*4096, because I had issues with that before. Will try.

pseudotensor commented 2 months ago

So far so good, let's see how goes. Thanks!

Using this for half of 80GB H100:

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=8' \
        --max-model-len=49152 \
        --max-log-len=100 \
pseudotensor commented 2 months ago

I'll close for now until see issues.