vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.76k stars 4.67k forks source link

[Bug]: Model does not split in multiple Gpus instead it occupy same memory on each GPU #10516

Open anilkumar0502 opened 4 days ago

anilkumar0502 commented 4 days ago

Your current environment

The output of `python collect_env.py` ```text Vllm version : 0.5.5 Nccl=2.20.5 Gpu : Telsa V100-sxm2-32GB Cuda version : 12.6 Driver version : 560.28.03 ```

Model Input Dumps

No response

🐛 Describe the bug

I have a node with 4 GPUs, each having 32 GB of memory. I’m loading the granite8B-code-instruct model with float16 precision, which requires approximately 16 GB of memory (8B parameters × 2 bytes). Using a tensor parallel size of 2, I plan to split the model across 2 GPUs for more efficient and faster serving.

My expectation is that each GPU should load 8Gb,

But Actual , it is loading 27GB on each gpu .

The nvidia-smi commands shows

GPU Memory-usage 0 28333MiB/32768MiB 1 28222MiB/ 32768MiB

Code :

vllm serve granite-8B-code-instruct --tensor-parallel-size 2 --dtype float16

Before submitting a new issue...

DarkLight1337 commented 4 days ago

By default, vLLM will use 90% GPU VRAM regardless of model size (the extra memory is used for KV cache). If you want to use less memory, please specify the --gpu-memory-utilization argument.

EzeLLM commented 2 days ago

By default, vLLM will use 90% GPU VRAM regardless of model size (the extra memory is used for KV cache). If you want to use less memory, please specify the --gpu-memory-utilization argument.

for me, still gives memory error. trying to split 40 gig model on 4 gpus eash with 40 gigs of vram.

DarkLight1337 commented 2 days ago

Can you show the command used and the stack trace?