[Usage]: cuda oom when serving multi task on same server

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.69k stars 4.66k forks source link

Open reneix opened 1 week ago

reneix commented 1 week ago

vllm 0.6.0
qwen2.5-14b
cuda 12.4

I would serving task generate and embedding on same server, but cuda oom can i serving generate on gpu , but embedding on cpu? please advice

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

zhur0ng commented 1 week ago

You can try to use the qwen2.5-14b model after INT4 quantization to reduce the GPU memory.

reneix commented 5 days ago

You can try to use the qwen2.5-14b model after INT4 quantization to reduce the GPU memory.

got, will have a try