Open izhuhaoran opened 3 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
π Describe the bug
I am running a test python script
test_llm.py
, test_llm.py code is as follows:Click to expand test_llm.py
```python import torch from vllm import LLM, SamplingParams import random import random import argparse import time random.seed(0) # Set the random seed for reproducibility _MB = 1 << 20 dummy_prompt = "hello " * 2000 prompts = [dummy_prompt for _ in range(512)] def test_llm(model:str, n, max_tokens, tp_size): prompts_choose = prompts[:n] # print(prompts_choose) # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=max_tokens, ignore_eos=True) # Create an LLM. llm = LLM(model=model, trust_remote_code=True, enforce_eager=True, disable_log_stats=False, max_num_seqs=n, tensor_parallel_size=tp_size, disable_custom_all_reduce=True, gpu_memory_utilization=1.0) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. torch.cuda.synchronize() time1 = time.perf_counter() outputs = llm.generate(prompts_choose, sampling_params) torch.cuda.synchronize() time2 = time.perf_counter() free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info() print( f"use_gpu_memory: {(total_gpu_memory - free_gpu_memory)/_MB:.4f} MB, " f"free_gpu_memory: {free_gpu_memory/_MB:.4f} MB, " f"total_gpu_memory: {total_gpu_memory/_MB:.4f} MB" ) print(f"\nllm.generate over. All Generate Time: {time2 - time1:.5f} s\n") # # Print the outputs. # for output in outputs: # prompt = output.prompt # generated_text = output.outputs[0].text # # print(f"Prompt: {prompt!r},\n") # print(f"Generated text: {generated_text!r}\n") def test(): parser = argparse.ArgumentParser(description='Test LLM') parser.add_argument('-n', type=int, default=256, help='Number of prompts') parser.add_argument('-max_tokens', type=int, default=128, help='Maximum number of tokens') parser.add_argument('-tp_size', type=int, default=1, help='Tensor Parallel Size') parser.add_argument('-model', type=str, help='Model path') args = parser.parse_args() n = args.n max_tokens = args.max_tokens tp_size = args.tp_size model = args.model test_llm(model, n, max_tokens, tp_size) test() ```run command is as follows:
When I use model: qwen-7b-chat ,gpu_memory_utilization=1.0, it crashes inexplicably, with the error: torch.cuda.OutOfMemoryError: CUDA out of memory.
The error output is:
Like issue #7256 , I modify the determine_num_available_blocks func code to:
OOM still occurs, but get 23% run progress:
And then , I think OOM could also be due to gpu memory leaks. So I add
torch.cuda.empty_cache()
in every step of LLM._run_engine() func like:At this point, the program runs successfully to the end likeοΌ
When I added
torch.cuda.empty_cache()
to get a successful run, this may indicate that the cache memories retained by the torch at the end of some previous steps were not fully used for allocation in the subsequent steps, resulting in insufficient allocation of gpu memory for the subsequent steps, and causing an OOM error.So, I suspect this may be some gpu mem leakage in multiple rounds of model forward, but I haven't further figured out exactly what's wrong with it.