Closed CatherineSue closed 2 months ago
@ywang96 Can you share some insight? Does it have something to do with the recent changes in VLM support?
There used to be a bug in the model's memory profiling where it didn't actually pass in images. During inference, this underestimation might have caused OOM.
After the fix, the available block count is reduced significantly which better reflects the true memory usage of the model. Re: your problem, this is expected as the model has 128k context length. If it can't fit in your GPU, try reducing the context length via max_model_len
or the sequence count via max_num_seqs
.
thanks for the explanation @DarkLight1337 !
Just for future reference - the bug was discovered and fixed in https://github.com/vllm-project/vllm/pull/5888 and https://github.com/vllm-project/vllm/pull/5214.
We have also updated examples/phi3v_example.py. The current profiling strategy is rather conservative, but improving it is definitely part of the next milestone!
@ywang96 I get same error using with max_num_seqs=1
.
Is there some way to fix it?
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (4544). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
As stated in the error message, you may have to decrease max_model_len
(e.g. 64k instead of 128k)
@DarkLight1337 Thanks decreasing the max_model_len solved the problem!
Your current environment
Two docker containers based on images built from vllm source 3de6e6a3 and 3f3b6b21
🐛 Describe the bug
I passed the same model Phi-3-vision-128k-instruct to each docker container:
For the version needs VLMConfig, here are the parameters
And with the container based on 3de6e6a3 more latest, it raises error:
But the container based on 3f3b6b21: