vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.04k stars 3.97k forks source link

[Bug]: Number of available GPU blocks drop significantly for Phi3-vision #6124

Closed CatherineSue closed 2 months ago

CatherineSue commented 2 months ago

Your current environment

Two docker containers based on images built from vllm source 3de6e6a3 and 3f3b6b21

🐛 Describe the bug

I passed the same model Phi-3-vision-128k-instruct to each docker container:

--tensor-parallel-size=1 \
--model=/models/Phi-3-vision-128k-instruct \

For the version needs VLMConfig, here are the parameters

--image-input-type="pixel_values" \
--image-feature-size=1921 \
--image-token-id=32044 \
--image-input-shape="1, 3, 1008, 1344" 

And with the container based on 3de6e6a3 more latest, it raises error:

INFO 07-04 01:04:14 gpu_executor.py:84] # GPU blocks: 5970, # CPU blocks: 682
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (95520). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

But the container based on 3f3b6b21:

INFO 07-04 01:40:03 gpu_executor.py:83] # GPU blocks: 8825, # CPU blocks: 682
INFO 07-04 01:40:05 model_runner.py:906] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CatherineSue commented 2 months ago

@ywang96 Can you share some insight? Does it have something to do with the recent changes in VLM support?

DarkLight1337 commented 2 months ago

There used to be a bug in the model's memory profiling where it didn't actually pass in images. During inference, this underestimation might have caused OOM.

After the fix, the available block count is reduced significantly which better reflects the true memory usage of the model. Re: your problem, this is expected as the model has 128k context length. If it can't fit in your GPU, try reducing the context length via max_model_len or the sequence count via max_num_seqs.

CatherineSue commented 2 months ago

thanks for the explanation @DarkLight1337 !

ywang96 commented 2 months ago

Just for future reference - the bug was discovered and fixed in https://github.com/vllm-project/vllm/pull/5888 and https://github.com/vllm-project/vllm/pull/5214.

We have also updated examples/phi3v_example.py. The current profiling strategy is rather conservative, but improving it is definitely part of the next milestone!

2U1 commented 2 months ago

@ywang96 I get same error using with max_num_seqs=1.

Is there some way to fix it?

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (4544). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

DarkLight1337 commented 2 months ago

As stated in the error message, you may have to decrease max_model_len (e.g. 64k instead of 128k)

2U1 commented 2 months ago

@DarkLight1337 Thanks decreasing the max_model_len solved the problem!