vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.69k stars 4.66k forks source link

[Usage]: cuda oom when serving multi task on same server #10345

Open reneix opened 1 week ago

reneix commented 1 week ago

Your current environment

vllm 0.6.0
qwen2.5-14b
cuda 12.4

How would you like to use vllm

I would serving task generate and embedding on same server, but cuda oom can i serving generate on gpu , but embedding on cpu? please advice

Before submitting a new issue...

zhur0ng commented 1 week ago

You can try to use the qwen2.5-14b model after INT4 quantization to reduce the GPU memory.

reneix commented 5 days ago

You can try to use the qwen2.5-14b model after INT4 quantization to reduce the GPU memory.

got, will have a try