Open yunll opened 1 month ago
Although you can use --speculative-draft-tensor-parallel-size 4
to configure the tp of draft model, currently vllm doesn't support draft tp other than 1 unfortunately. Besides the weight, vllm also needs to allocate some GPU mem for KVCache (both for the target model and the draft model), so the consumed mem is larger than you thought. Also, you have set --gpu_memory_utilization 0.9
. So I recommend you set gpu_memory_utilization
to 0.95, and try using a quantized draft model, such as Qwen2.5-7B-Instruct-GPTQ-Int4
.
Although you can use
--speculative-draft-tensor-parallel-size 4
to configure the tp of draft model, currently vllm doesn't support draft tp other than 1 unfortunately. Besides the weight, vllm also needs to allocate some GPU mem for KVCache (both for the target model and the draft model), so the consumed mem is larger than you thought. Also, you have set--gpu_memory_utilization 0.9
. So I recommend you setgpu_memory_utilization
to 0.95, and try using a quantized draft model, such asQwen2.5-7B-Instruct-GPTQ-Int4
.
thank you for answer.
Will the draft model be deployed on GPU0? And if draft model can be deployed on another gpu such as gpu5?
And even if I set max-model-len to 100, it still results in an out-of-memory error. I believe that in this case, the memory is sufficient for kvcache.
Will the draft model be deployed on GPU0? And if draft model can be deployed on another gpu such as gpu5?
Yes, the draft model can only be deployed on the GPU with logical id == 0 currently, If you use CUDA_VISIBLE_DEVICES=1,2,3
, then it should be GPU1.
And even if I set max-model-len to 100, it still results in an out-of-memory error. I believe that in this case, the memory is sufficient for kvcache.
The memory required for the KVCache is positively correlated to the value of max-model-len. If 100 is still not working in your case before using a smaller model, then the draft model weight must be the root of your problem.
Your current environment
I deploy Qwen2.5-32B-Instruct with speculative_model Qwen2.5-7B-Instruct, and my gpu is NVIDIA A100-SXM4-40GB; And get OutOfMemoryError; In my opinion,32B model will be deployed in 4 GPUs due to tensor parallelism and will allocate about 16G memory;However, I am not sure how the speculative model is deployed. Will it be deployed on GPU0? But GPU0 should still have around 24GB of memory, which should be sufficient to deploy a 7B model.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...