Open sfc-gh-zhwang opened 7 months ago
After below code, is there an api(maybe like
llm.terminate
) to kill llm and release the GPU memory?from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) outputs = llm.generate(prompts, sampling_params)
Please check the codes below. It works.
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)
# Delete the llm object and free the memory
destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")
Best regards,
Shuyue Dec. 3rd, 2023
mark
Even after executing the code above, the GPU memory is not freed with the latest vllm built from source. Any recommendations?
Are there any updates on this? the above code does not work for me either
+1
I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?
+1
I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?
I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.
But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.
#llm is a vllm.LLM object
import gc
import torch
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
destroy_model_parallel()
#del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
del llm.llm_engine.model_executor
del llm
gc.collect()
torch.cuda.empty_cache()
import ray
ray.shutdown()
Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.
I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.
But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.
#llm is a vllm.LLM object import gc import torch from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel destroy_model_parallel() #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object del llm.llm_engine.model_executor del llm gc.collect() torch.cuda.empty_cache() import ray ray.shutdown()
Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.
update: the following code would work better, without the possible dead lock warning.
#llm is a vllm.LLM object
import gc
import torch
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
import os
#avoid huggingface/tokenizers process dead lock
os.environ["TOKENIZERS_PARALLELISM"] = "false"
destroy_model_parallel()
#del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
del llm.llm_engine.model_executor
del llm
gc.collect()
torch.cuda.empty_cache()
import ray
ray.shutdown()
In the latest version of vLLM destroy_model_parallel
has moved to vllm.distributed.parallel_state
. The objects you have to delete have also changed:
from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()
In the latest version of vLLM
destroy_model_parallel
has moved tovllm.distributed.parallel_state
. The objects you have to delete have also changed:from vllm.distributed.parallel_state import destroy_model_parallel ... destroy_model_parallel() del llm.llm_engine.model_executor.driver_worker del llm # Isn't necessary for releasing memory, but why not gc.collect() torch.cuda.empty_cache()
thx a lot
vLLM seems to hang to the first allocated LLM() instance. It does not hang to later instances. Maybe that helps with diagnosing the issue?
from vllm import LLM
def show_memory_usage():
import torch.cuda
import torch.distributed
import gc
print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
gc.collect()
# torch.distributed.destroy_process_group()
torch.cuda.empty_cache()
print(f" --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")
def gc_problem():
show_memory_usage()
print("loading llm0")
llm0 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=180)
del llm0
show_memory_usage()
print("loading llm1")
llm1 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=500)
del llm1
show_memory_usage()
print("loading llm2")
llm2 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=600)
del llm2
show_memory_usage()
gc_problem()
root@c09a058c2d5b:/workspaces/aici/py/vllm# python tests/core/block/e2e/gc_problem.py |grep -v INFO
cuda memory: 0MB
--> after gc: 0MB
loading llm0
cuda memory: 368MB
--> after gc: 368MB
loading llm1
cuda memory: 912MB
--> after gc: 368MB
loading llm2
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
cuda memory: 961MB
--> after gc: 368MB
root@c09a058c2d5b:/workspaces/aici/py/vllm#
The llm1
consumes more than llm0
but you can see that the allocated memory stays at llm0 level.
In the latest version of vLLM
destroy_model_parallel
has moved tovllm.distributed.parallel_state
. The objects you have to delete have also changed:from vllm.distributed.parallel_state import destroy_model_parallel ... destroy_model_parallel() del llm.llm_engine.model_executor.driver_worker del llm # Isn't necessary for releasing memory, but why not gc.collect() torch.cuda.empty_cache()
Tried this including ray.shutdown()
but the memory is not released on my end, any other suggestion?
Tried this including
ray.shutdown()
but the memory is not released on my end, any other suggestion?
could try the "del llm.llm_engine.model_executor" in the following code instead:
update: the following code would work better, without the possible dead lock warning.
#llm is a vllm.LLM object import gc import torch from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel import os #avoid huggingface/tokenizers process dead lock os.environ["TOKENIZERS_PARALLELISM"] = "false" destroy_model_parallel() #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object del llm.llm_engine.model_executor del llm gc.collect() torch.cuda.empty_cache() import ray ray.shutdown()
Tried this including
ray.shutdown()
but the memory is not released on my end, any other suggestion?could try the "del llm.llm_engine.model_executor" in the following code instead:
update: the following code would work better, without the possible dead lock warning.
#llm is a vllm.LLM object import gc import torch from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel import os #avoid huggingface/tokenizers process dead lock os.environ["TOKENIZERS_PARALLELISM"] = "false" destroy_model_parallel() #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object del llm.llm_engine.model_executor del llm gc.collect() torch.cuda.empty_cache() import ray ray.shutdown()
did that as well, still no change in gpu memory allocation. Not sure how to go further
In the latest version of vLLM
destroy_model_parallel
has moved tovllm.distributed.parallel_state
. The objects you have to delete have also changed:from vllm.distributed.parallel_state import destroy_model_parallel ... destroy_model_parallel() del llm.llm_engine.model_executor.driver_worker del llm # Isn't necessary for releasing memory, but why not gc.collect() torch.cuda.empty_cache()
We tried this in version 0.4.2, but GPU memory did not released.
did that as well, still no change in gpu memory allocation. Not sure how to go further
Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1
@zheyang0825 does adding this lines at the end make it work?
torch.distributed.destroy_process_group()
did that as well, still no change in gpu memory allocation. Not sure how to go further
Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1
tried on 0.4.0.post1 and method worked, not sure what changed in the latest version that's not releasing the memory, possible bug?
Hello ! so if I'm not wrong, no one achieved to release memory on vllm 0.4.2 yet ?
A new bug was introduced in 0.4.2, but fixed in https://github.com/vllm-project/vllm/pull/4737. Please try with that PR or as a workaround you can also install tensorizer
.
This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.
A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install
tensorizer
.This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.
I updated vllm yesterday and still have the problem, I'm using those lines :
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
This code is worked for me
vllm==0.4.0.post1
import gc
import ray
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
print('service stopping ..')
print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")
destroy_model_parallel()
del model.llm_engine.model_executor.driver_worker
del model
gc.collect()
torch.cuda.empty_cache()
ray.shutdown()
print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")
print("service stopped")
There should be a built-in way! We cannot keep writing code that breaks on the next minor release :(
In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .
I would say, the most stable way to terminate vLLM is to shut down the process.
A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install
tensorizer
.This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.
I encountered this issue when TP = 8. I'm doing this in a iterative manner since I need to run the embedding model after the generative model so there are so loading / offloading. The first iteration is fine but the second iteration the instantiation of vllm ray server hangs.
In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .
I would say, the most stable way to terminate vLLM is to shut down the process.
I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers
would be too slow, making my technique unviable.
Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.
I don't know if anyone can currently clear memory correctly, but in version 0.4.2, I applied the above code that failed to clear memory correctly. I can only use a slightly extreme method of creating a new process before calling and closing the process after calling to roughly solve the problem:
from multiprocessing import Process, set_start_method
set_start_method('spawn', force=True)
def vllm_texts(model_path):
prompts=""
sampling_params = SamplingParams(max_tokens=512)
llm = LLM(model=model_path)
outputs = llm.generate(prompts, sampling_params)
...
print(torch.cuda.memory_summary())
p = Process(target=vllm_texts, args=(model_path))
p.start()
p.join()
if p.is_alive():
p.terminate()
p.close()
print(torch.cuda.memory_summary())
...
I still hope there is a way in the future to correctly and perfectly clear memory
While I am using multiple GPUs to serve a LLM (tensor_parallel_size > 1
), the GPUs' memory is not released, except the first GPU (cuda:0
).
In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks . I would say, the most stable way to terminate vLLM is to shut down the process.
I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using
transformers
would be too slow, making my technique unviable.Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.
Glad to see you here @cassanof and to hear that you have been using vLLM in this kind of workflow!
Given how much wanted this feature seems to be, I will bring this back to the team to discuss! If multi-gpu instance is prone to deadlocks, then perhaps we can at least start with single-gpu instances. Everyone on the maintainer team does have limited bandwidth and we have a lot of things to work on, so contributions are very welcomed as always!
After below code, is there an api(maybe like
llm.terminate
) to kill llm and release the GPU memory?