vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.4k stars 3.67k forks source link

unload the model #3281

Open osafaimal opened 5 months ago

osafaimal commented 5 months ago

Hi, i m sorry, i don't find how unload model. like i load a model, i delete the object and i call the garbage collector but it does nothing. How we are suppose to unload model? I want to load a model do a batch, load an other do a batch, like that for multiple models for comparing them. But for now i must stop python each time.

hmellor commented 5 months ago

Try calling torch.cuda.empty_cache() after you delete the LLM object

chenxu2048 commented 5 months ago

You can also use gc.collect() to remove *garbage* objects immediately, after you delete them.

osafaimal commented 5 months ago

image both doesn't work.

chenxu2048 commented 5 months ago

You should also clean Notebook output: https://stackoverflow.com/questions/24816237/ipython-notebook-clear-cell-output-in-code

osafaimal commented 5 months ago

i always do (In the GUI not in my cells)

mnoukhov commented 5 months ago

this seems mostly solved by #1908 with

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)

# Delete the llm object and free the memory
destroy_model_parallel()
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")
osafaimal commented 4 months ago

this seems mostly solved by #1908 with

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)

# Delete the llm object and free the memory
destroy_model_parallel()
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")

i had already read that. My problem stay unsolved when i use the Vllm from llamaindex otherwise it almost works. I've a little of memory that stay used (~1GB) but at least i can load and unload the models. the problem is that i don't find how access to the member llm_engine of Vllm.LLM