Open haiasd opened 1 year ago
Can you provide more details on what model are you using, and how many GPUs are you using? Any more details can be helpful. Thank you!
I'm running starcoder on 2*A10, The command is as follows: python -m vllm.entrypoints.api_server --model /model/starchat/starcoder-codewovb-wlmhead-mg2hf41 --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --host 0.0.0.0 --port 8081 --max-num-batched-tokens 5120
same question when loading llama2 70b models on 4 gpus
same question when loading llama2 70b models on 2 gpus
Same issue w/ Mixtral 8x7B Instruct 0.1 (non-quantized)
we met the same issue with mistral7b: TP = 4 GPU = 4 * A10 vllm = 0.2.7
tensor_parallel_size also meet memory leak. TP=1, GPU = 1*A30 vllm = 0.3.3
tensor_parallel_size also meets memory leak. TP=2, GPU = 2*V100 vllm = 0.4.2