[Bug]: Model does not split in multiple Gpus instead it occupy same memory on each GPU

anilkumar0502 commented 4 days ago

Your current environment

The output of `python collect_env.py`

```text Vllm version : 0.5.5 Nccl=2.20.5 Gpu : Telsa V100-sxm2-32GB Cuda version : 12.6 Driver version : 560.28.03 ```

Model Input Dumps

No response

🐛 Describe the bug

I have a node with 4 GPUs, each having 32 GB of memory. I’m loading the granite8B-code-instruct model with float16 precision, which requires approximately 16 GB of memory (8B parameters × 2 bytes). Using a tensor parallel size of 2, I plan to split the model across 2 GPUs for more efficient and faster serving.

My expectation is that each GPU should load 8Gb,

But Actual , it is loading 27GB on each gpu .

The nvidia-smi commands shows

GPU Memory-usage 0 28333MiB/32768MiB 1 28222MiB/ 32768MiB

Code :

vllm serve granite-8B-code-instruct --tensor-parallel-size 2 --dtype float16

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 4 days ago

By default, vLLM will use 90% GPU VRAM regardless of model size (the extra memory is used for KV cache). If you want to use less memory, please specify the --gpu-memory-utilization argument.

EzeLLM commented 2 days ago

By default, vLLM will use 90% GPU VRAM regardless of model size (the extra memory is used for KV cache). If you want to use less memory, please specify the --gpu-memory-utilization argument.

for me, still gives memory error. trying to split 40 gig model on 4 gpus eash with 40 gigs of vram.

DarkLight1337 commented 2 days ago

Can you show the command used and the stack trace?

vllm-project / vllm