vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.84k stars 3.94k forks source link

Memory leak while using tensor_parallel_size>1 #694

Open haiasd opened 1 year ago

haiasd commented 1 year ago

image

zhuohan123 commented 1 year ago

Can you provide more details on what model are you using, and how many GPUs are you using? Any more details can be helpful. Thank you!

haiasd commented 1 year ago

I'm running starcoder on 2*A10, The command is as follows: python -m vllm.entrypoints.api_server --model /model/starchat/starcoder-codewovb-wlmhead-mg2hf41 --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --host 0.0.0.0 --port 8081 --max-num-batched-tokens 5120

wonderseen commented 10 months ago

same question when loading llama2 70b models on 4 gpus

ChristineSeven commented 8 months ago

same question when loading llama2 70b models on 2 gpus

wangcho2k commented 7 months ago

Same issue w/ Mixtral 8x7B Instruct 0.1 (non-quantized)

PeterWang1986 commented 7 months ago

we met the same issue with mistral7b: TP = 4 GPU = 4 * A10 vllm = 0.2.7

memory
austingg commented 3 months ago

tensor_parallel_size also meet memory leak. TP=1, GPU = 1*A30 vllm = 0.3.3

yarinlaniado commented 3 months ago

tensor_parallel_size also meets memory leak. TP=2, GPU = 2*V100 vllm = 0.4.2