vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.28k stars 4.59k forks source link

[Usage]: Out of Memory w/ multiple models #4678

Closed yudataguy closed 1 month ago

yudataguy commented 6 months ago

Your current environment

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 

How would you like to use vllm

I'm running a eval framework that's evaluating multiple models. vllm doesn't seem to free the gpu memory after initialize the 2nd model (with the same variable name), how to free up gpu memory with each vLLMEngine call llm = LLM(new_model)

yudataguy commented 6 months ago

Tried methods from https://github.com/vllm-project/vllm/issues/1908 no success

russellb commented 1 month ago

the LLM engine internal to the LLM class should get destroyed when your LLM instance is garbage collected. You could try forcing that with del(llm).

russellb commented 1 month ago

more detailed input on this in #3281

russellb commented 1 month ago

going to close this since it's a duplicate of #3281