Memory leak while using tensor_parallel_size>1

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

26.84k stars 3.94k forks source link

Memory leak while using tensor_parallel_size>1 #694

Open haiasd opened 1 year ago

haiasd commented 1 year ago

zhuohan123 commented 1 year ago

Can you provide more details on what model are you using, and how many GPUs are you using? Any more details can be helpful. Thank you!

haiasd commented 1 year ago

I'm running starcoder on 2*A10, The command is as follows: python -m vllm.entrypoints.api_server --model /model/starchat/starcoder-codewovb-wlmhead-mg2hf41 --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --host 0.0.0.0 --port 8081 --max-num-batched-tokens 5120