[Question]: delete VLLM models

osafaimal commented 3 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I didn't find how i can delete the model. with the default Vllm:

import gc
import torch
from vllm import LLM

# Load the model via vLLM
llm = LLM(model=model, download_dir="./models",max_model_len=27952)
!nvidia-smi

# Delete the llm object and free the memory
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
!nvidia-smi

it works but when i try:

from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
import gc
import torch

# Delete the llm object and free the memory
del llm.llm_engine.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
!nvidia-smi

with the one of llamaindex, there is no llm_engine member and if i remove this line the memory is still used by python. So how should we unload the model from the memory with llamaindex?

dosubot[bot] commented 3 months ago

Hey @osafaimal, great to see you diving deeper into LlamaIndex challenges! Seems like you're wrestling with a tough one this time 🤖

Based on the information provided, it seems that the LlamaIndex model object does not have a direct equivalent to the llm_engine member found in the default VLLM model. This is likely why you're encountering issues when trying to free up memory in the same way.

However, you can still free up memory used by LlamaIndex models by following a more generic approach suitable for Python objects. This involves dereferencing the model object and then forcing garbage collection. Here's how you can do it:

import gc

# Assuming `model` is your loaded LLM model object
model = load_llm(your_model_data)

# Your code that uses the model goes here

# Dereference the model
model = None

# Explicitly invoke garbage collection
gc.collect()

This approach works under the assumption that the Python garbage collector can manage the cleanup once all references to the object are removed. However, it's important to note that the effectiveness of this method can depend on the specific implementation of the LLM classes and the underlying resources they use (e.g., file handles, network connections, external processes). If those classes allocate resources that require explicit cleanup, you should ensure that any necessary cleanup methods provided by those classes are called before dereferencing the model objects.

For more information, you can refer to the Python documentation on garbage collection.

Regarding the structure of the LlamaIndex model object, it is designed to run a vLLM model locally or connect to a vLLM server. It inherits from the LLM class and is tailored for integrating with various vLLM models, particularly those hosted on HuggingFace or similar platforms. The structure of the Vllm class includes several key attributes and methods that configure and control the behavior of the underlying language model. You can find more details in the source code.

As for memory management or cleanup, the BaseMemory class, which serves as a base class for all memory types in the LlamaIndex model, does not explicitly define any specific methods or attributes dedicated to memory management or cleanup beyond the reset method. The reset method is abstract and intended to reset the chat history, which could be considered a form of cleanup. However, there are no methods or attributes explicitly named or described as being for memory management or cleanup in the traditional sense (e.g., freeing up memory resources, garbage collection facilitation). The focus of the provided methods (get, get_all, put, set, reset) is primarily on managing chat history data rather than managing memory resources directly. You can find more details in the source code.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

osafaimal commented 3 months ago

no it doesn't work

dosubot[bot] commented 3 months ago

To address the issue of unloading the LlamaIndex model from memory, it's crucial to ensure that all references to the model and its associated data structures are removed. This allows Python's garbage collector to reclaim the memory. Here's a revised approach to delete the model and free up memory:

from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
import gc
import torch

# Delete the llm object and free the memory
del llm
gc.collect()
torch.cuda.empty_cache()
!nvidia-smi

By deleting the llm object and then manually invoking Python's garbage collector with gc.collect(), followed by clearing the CUDA cache using torch.cuda.empty_cache(), you're effectively removing references to the model. This should free up the memory previously occupied by the model.

It's important to note that the success of this operation depends on ensuring there are no other references to the model or its components elsewhere in your code. In complex applications, especially those that might create circular references or use global variables, it's possible for memory to remain allocated if references to the objects you wish to delete still exist. Although Python's garbage collector is designed to handle circular references, it's best practice to manage references carefully and ensure objects are dereferenced when no longer needed.

If you continue to face issues with memory not being freed, it would be beneficial to examine the structure of your llm object and its usage within your code more closely. This could provide further insights into why memory is not being released as expected and allow for more targeted solutions.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 3 months ago

The code does this in the constructor

self._client = VLLModel(
    model=model,
    tensor_parallel_size=tensor_parallel_size,
    trust_remote_code=trust_remote_code,
    dtype=dtype,
    download_dir=download_dir,
    **vllm_kwargs
)

So, probably some operation on the client is needed. Kind of surprised just

del llamaindex_llm
gc.collect()
torch.cuda.empty_cache()

Doesn't work. GPU memory is hard to manage in my experience

run-llama / llama_index