vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.36k stars 4.6k forks source link

[Usage]: speculative OutOfMemoryError: #8731

Open yunll opened 1 month ago

yunll commented 1 month ago

Your current environment

I deploy Qwen2.5-32B-Instruct with speculative_model Qwen2.5-7B-Instruct, and my gpu is NVIDIA A100-SXM4-40GB; And get OutOfMemoryError; In my opinion,32B model will be deployed in 4 GPUs due to tensor parallelism and will allocate about 16G memory;However, I am not sure how the speculative model is deployed. Will it be deployed on GPU0? But GPU0 should still have around 24GB of memory, which should be sufficient to deploy a 7B model.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9000 --model Qwen2.5-32B-Instruct -tp 4 --max-model-len 1000 --speculative_model Qwen2.5-7B-Instruct --use-v2-block-manager --num_speculative_tokens 5  --speculative-max-model-len 1000 --gpu_memory_utilization 0.9
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 60.81 MiB is free. Process 3179876 has 1.62 GiB memory in use. Process 3179877 has 1.62 GiB memory in use. Including non-PyTorch memory, this process has 34.61 GiB memory in use. Process 3179875 has 1.62 GiB memory in use. Of the allocated memory 28.75 GiB is allocated by PyTorch, with 28.60 MiB allocated in private pools (e.g., CUDA Graphs), and 384.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

ShangmingCai commented 1 month ago

Although you can use --speculative-draft-tensor-parallel-size 4 to configure the tp of draft model, currently vllm doesn't support draft tp other than 1 unfortunately. Besides the weight, vllm also needs to allocate some GPU mem for KVCache (both for the target model and the draft model), so the consumed mem is larger than you thought. Also, you have set --gpu_memory_utilization 0.9. So I recommend you set gpu_memory_utilization to 0.95, and try using a quantized draft model, such as Qwen2.5-7B-Instruct-GPTQ-Int4.

yunll commented 1 month ago

Although you can use --speculative-draft-tensor-parallel-size 4 to configure the tp of draft model, currently vllm doesn't support draft tp other than 1 unfortunately. Besides the weight, vllm also needs to allocate some GPU mem for KVCache (both for the target model and the draft model), so the consumed mem is larger than you thought. Also, you have set --gpu_memory_utilization 0.9. So I recommend you set gpu_memory_utilization to 0.95, and try using a quantized draft model, such as Qwen2.5-7B-Instruct-GPTQ-Int4.

thank you for answer.

Will the draft model be deployed on GPU0? And if draft model can be deployed on another gpu such as gpu5?

And even if I set max-model-len to 100, it still results in an out-of-memory error. I believe that in this case, the memory is sufficient for kvcache.

ShangmingCai commented 1 month ago

Will the draft model be deployed on GPU0? And if draft model can be deployed on another gpu such as gpu5?

Yes, the draft model can only be deployed on the GPU with logical id == 0 currently, If you use CUDA_VISIBLE_DEVICES=1,2,3, then it should be GPU1.

And even if I set max-model-len to 100, it still results in an out-of-memory error. I believe that in this case, the memory is sufficient for kvcache.

The memory required for the KVCache is positively correlated to the value of max-model-len. If 100 is still not working in your case before using a smaller model, then the draft model weight must be the root of your problem.