Open CatYing opened 4 months ago
same issue, marked.
Can you share your launch command so I can look?
There are N Workers
which have N CacheEngines
...
@robertgshaw2-neuralmagic sry for the late response.
Here is the vllm version used:
For example, we want to deploy a certian LLM deepseek-coder-6.7b-instruct
on 2*T4 card, which have 32GB memory total(2*16GB)
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model="/cm/deepseek-coder-6.7b-instruct/" --tensor-parallel-size=2 --trust-remote-code --dtype="float16"
vllm returns error log
ValueError: The model's max seq len (65536) is larger than the maximum number of tokens that can be stored in KV cache (8112). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
It means that, KVCache for 2*T4 Card, there are only 8112 context length for LLM.
But, if we deploy the model on 1*V100 (32GB) card:
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.api_server --model="/cm/deepseek-coder-6.7b-instruct/" --tensor-parallel-size=1 --trust-remote-code --dtype="float16"
ValueError: The model's max seq len (65536) is larger than the maximum number of tokens that can be stored in KV cache (22496). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
the avalible KVCache sizes are very different between 2*T4 and 1*V100, although they both have 32GB memory
I
KVCache usage
it seems like the KVCache is only allocated on the first device in CUDA_VISIBLE_DEVICES, even if using
tensor-parallel-size
> 1. Is there any plan to support full KVCache allocated in all devices?cache_engine.py