vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.87k stars 4.11k forks source link

[Misc]: vllm ONLY allocate KVCache on the first device in CUDA_VISIBLE_DEVICES #5248

Open CatYing opened 4 months ago

CatYing commented 4 months ago

KVCache usage

it seems like the KVCache is only allocated on the first device in CUDA_VISIBLE_DEVICES, even if using tensor-parallel-size > 1. Is there any plan to support full KVCache allocated in all devices?

cache_engine.py

        # Initialize the cache. # line 56
        self.gpu_cache = self._allocate_kv_cache(self.num_gpu_blocks, "cuda")
arlenkkk commented 4 months ago

same issue, marked.

robertgshaw2-neuralmagic commented 4 months ago

Can you share your launch command so I can look?

There are N Workers which have N CacheEngines...

CatYing commented 3 months ago

@robertgshaw2-neuralmagic sry for the late response.

Here is the vllm version used: image

For example, we want to deploy a certian LLM deepseek-coder-6.7b-instruct on 2*T4 card, which have 32GB memory total(2*16GB)

CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model="/cm/deepseek-coder-6.7b-instruct/" --tensor-parallel-size=2 --trust-remote-code --dtype="float16"

vllm returns error log

ValueError: The model's max seq len (65536) is larger than the maximum number of tokens that can be stored in KV cache (8112). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

It means that, KVCache for 2*T4 Card, there are only 8112 context length for LLM.

But, if we deploy the model on 1*V100 (32GB) card:

 CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.api_server --model="/cm/deepseek-coder-6.7b-instruct/" --tensor-parallel-size=1 --trust-remote-code --dtype="float16"
ValueError: The model's max seq len (65536) is larger than the maximum number of tokens that can be stored in KV cache (22496). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

the avalible KVCache sizes are very different between 2*T4 and 1*V100, although they both have 32GB memory

I