vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.73k stars 4.49k forks source link

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

Closed pseudotensor closed 4 months ago

pseudotensor commented 4 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Same launching as https://github.com/vllm-project/vllm/issues/5969

Only difference is hash 2cd402e1692417b7645e4ece11bc2ab91072f47c (latest main as of earlier today).

GPU is totally free, so just new bug in vLLM between the e9de9dd551ac595a9f3825fcd1507deceef4f332 and 2cd402e1692417b7645e4ece11bc2ab91072f47c hashes

INFO 06-28 23:40:03 api_server.py:206] vLLM API server version 0.5.0.post1
INFO 06-28 23:40:03 api_server.py:207] args: Namespace(host='0.0.0.0', port=5063, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, respon>
INFO 06-28 23:40:03 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto>
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:05 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-28 23:40:06 model_runner.py:220] Loading model weights took 7.7732 GB
/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast>
  warnings.warn(
INFO 06-28 23:40:14 gpu_executor.py:83] # GPU blocks: 3184, # CPU blocks: 682
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 225, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 425, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 359, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 500, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 246, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 342, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 86, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 207, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 344, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
pseudotensor commented 4 months ago

Basically something is wrong now that was ok before. Can't even run phi-3 vision on 80GB H100 now.

DarkLight1337 commented 4 months ago

Hi, thanks for the report!

Can you try reverting to 96354d6a2967a63eb5c0e56a2da2ead512ff1067 (right before 2061f0b8a7f1a01683c4045096a092eedf6387a4)? I believe #5888 may be causing the issue.

ywang96 commented 4 months ago

Hi @pseudotensor! This is in fact not a bug, but a fix to a previous bug in the initial Phi-3 PR that image payload was always None instead of actual pixel values during profiling, resulting in an incorrect over-estimation for available space for KV blocks that would result in OOM when server is under max load. (though the fixed profiling is conservative itself, but we would rather keep it that way for now instead of leaving the possibility of crashing the server).

If you limit your --max-num-seqs to a lower number (I've tested on H100 that it can go up to 17), you should still be able to launch the server with the full context-length.

ywang96 commented 4 months ago

I've also made #5981 to avoid this confusion.

pseudotensor commented 4 months ago

Ok, I've misunderstood max_num_seqs then. I thought hat was a max, not a required limit. So I would have expected context length to supersede the number of sequences, and so the number of sequences to be automatically reduced to accommodate my chosen context length.