Open anilkumar0502 opened 4 days ago
By default, vLLM will use 90% GPU VRAM regardless of model size (the extra memory is used for KV cache). If you want to use less memory, please specify the --gpu-memory-utilization
argument.
By default, vLLM will use 90% GPU VRAM regardless of model size (the extra memory is used for KV cache). If you want to use less memory, please specify the
--gpu-memory-utilization
argument.
for me, still gives memory error. trying to split 40 gig model on 4 gpus eash with 40 gigs of vram.
Can you show the command used and the stack trace?
Your current environment
The output of `python collect_env.py`
```text Vllm version : 0.5.5 Nccl=2.20.5 Gpu : Telsa V100-sxm2-32GB Cuda version : 12.6 Driver version : 560.28.03 ```Model Input Dumps
No response
🐛 Describe the bug
I have a node with 4 GPUs, each having 32 GB of memory. I’m loading the granite8B-code-instruct model with float16 precision, which requires approximately 16 GB of memory (8B parameters × 2 bytes). Using a tensor parallel size of 2, I plan to split the model across 2 GPUs for more efficient and faster serving.
My expectation is that each GPU should load 8Gb,
But Actual , it is loading 27GB on each gpu .
The nvidia-smi commands shows
GPU Memory-usage 0 28333MiB/32768MiB 1 28222MiB/ 32768MiB
Code :
vllm serve granite-8B-code-instruct --tensor-parallel-size 2 --dtype float16
Before submitting a new issue...