vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.28k stars 3.14k forks source link

Is there a way to terminate vllm.LLM and release the GPU memory #1908

Open sfc-gh-zhwang opened 7 months ago

sfc-gh-zhwang commented 7 months ago

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
SuperBruceJia commented 7 months ago

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)

Please check the codes below. It works.

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)

# Delete the llm object and free the memory
destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")

Best regards,

Shuyue Dec. 3rd, 2023

hijkzzz commented 7 months ago

mark

deepbrain commented 4 months ago

Even after executing the code above, the GPU memory is not freed with the latest vllm built from source. Any recommendations?

huylenguyen commented 4 months ago

Are there any updates on this? the above code does not work for me either

puddingfjz commented 4 months ago

+1

puddingfjz commented 4 months ago

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?

shyringo commented 2 months ago

+1

shyringo commented 2 months ago

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.


But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

shyringo commented 2 months ago

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.

But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()
ticoneva commented 2 months ago

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()
rbao2018 commented 2 months ago

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

thx a lot

mmoskal commented 1 month ago

vLLM seems to hang to the first allocated LLM() instance. It does not hang to later instances. Maybe that helps with diagnosing the issue?

from vllm import LLM

def show_memory_usage():
    import torch.cuda
    import torch.distributed
    import gc

    print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
    gc.collect()
    # torch.distributed.destroy_process_group()
    torch.cuda.empty_cache()
    print(f"  --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")

def gc_problem():
    show_memory_usage()
    print("loading llm0")
    llm0 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=180)
    del llm0
    show_memory_usage()

    print("loading llm1")
    llm1 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=500)
    del llm1
    show_memory_usage()

    print("loading llm2")
    llm2 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=600)
    del llm2
    show_memory_usage()

gc_problem()
root@c09a058c2d5b:/workspaces/aici/py/vllm# python tests/core/block/e2e/gc_problem.py |grep -v INFO
cuda memory: 0MB
  --> after gc: 0MB
loading llm0
cuda memory: 368MB
  --> after gc: 368MB
loading llm1
cuda memory: 912MB
  --> after gc: 368MB
loading llm2
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
cuda memory: 961MB
  --> after gc: 368MB
root@c09a058c2d5b:/workspaces/aici/py/vllm# 

The llm1 consumes more than llm0 but you can see that the allocated memory stays at llm0 level.

yudataguy commented 1 month ago

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

shyringo commented 1 month ago

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()
yudataguy commented 1 month ago

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

did that as well, still no change in gpu memory allocation. Not sure how to go further

zheyang0825 commented 1 month ago

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

We tried this in version 0.4.2, but GPU memory did not released.

shyringo commented 1 month ago

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

mnoukhov commented 1 month ago

@zheyang0825 does adding this lines at the end make it work?

torch.distributed.destroy_process_group()         
yudataguy commented 1 month ago

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

tried on 0.4.0.post1 and method worked, not sure what changed in the latest version that's not releasing the memory, possible bug?

GurvanR commented 1 month ago

Hello ! so if I'm not wrong, no one achieved to release memory on vllm 0.4.2 yet ?

njhill commented 1 month ago

A new bug was introduced in 0.4.2, but fixed in https://github.com/vllm-project/vllm/pull/4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

GurvanR commented 1 month ago

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I updated vllm yesterday and still have the problem, I'm using those lines :

destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
Misterrendal commented 1 month ago

This code is worked for me

vllm==0.4.0.post1

        import gc
        import ray
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

        print('service stopping ..')
        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        destroy_model_parallel()

        del model.llm_engine.model_executor.driver_worker
        del model

        gc.collect()
        torch.cuda.empty_cache()
        ray.shutdown()

        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        print("service stopped")
cassanof commented 1 month ago

There should be a built-in way! We cannot keep writing code that breaks on the next minor release :(

youkaichao commented 1 month ago

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

Vincent-Li-9701 commented 1 month ago

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I encountered this issue when TP = 8. I'm doing this in a iterative manner since I need to run the embedding model after the generative model so there are so loading / offloading. The first iteration is fine but the second iteration the instantiation of vllm ray server hangs.

cassanof commented 1 month ago

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

DuZKai commented 1 month ago

I don't know if anyone can currently clear memory correctly, but in version 0.4.2, I applied the above code that failed to clear memory correctly. I can only use a slightly extreme method of creating a new process before calling and closing the process after calling to roughly solve the problem:

from multiprocessing import Process, set_start_method
set_start_method('spawn', force=True)
def vllm_texts(model_path):
    prompts=""
    sampling_params = SamplingParams(max_tokens=512)
    llm = LLM(model=model_path)
    outputs = llm.generate(prompts, sampling_params)

...
print(torch.cuda.memory_summary())
p = Process(target=vllm_texts, args=(model_path))
p.start()
p.join()
if p.is_alive():
    p.terminate()
p.close()
print(torch.cuda.memory_summary())
...

I still hope there is a way in the future to correctly and perfectly clear memory

SuperBruceJia commented 3 weeks ago

While I am using multiple GPUs to serve a LLM (tensor_parallel_size > 1), the GPUs' memory is not released, except the first GPU (cuda:0).

image
ywang96 commented 2 weeks ago

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks . I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

Glad to see you here @cassanof and to hear that you have been using vLLM in this kind of workflow!

Given how much wanted this feature seems to be, I will bring this back to the team to discuss! If multi-gpu instance is prone to deadlocks, then perhaps we can at least start with single-gpu instances. Everyone on the maintainer team does have limited bandwidth and we have a lot of things to work on, so contributions are very welcomed as always!