Open izhuhaoran opened 3 months ago
mark
+1,same problem, hope it's fixed
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
I am running a test python script
test_llm.py
, test_llm.py code is as follows:Click to expand test_llm.py
```python import torch from vllm import LLM, SamplingParams import random import random import argparse import time random.seed(0) # Set the random seed for reproducibility dummy_prompt = "hello " * 30 # print(dummy_prompt) prompts = [] with open("./benchmarks/sonnet.txt", "r") as f: prompts = f.readlines() prompts = [prompt.strip() for prompt in prompts] # random.shuffle(prompts) def test_llm(model:str, n, max_tokens, tp_size): prompts_choose = prompts[:n] # print(prompts_choose) # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=max_tokens, ignore_eos=True) # Create an LLM. llm = LLM(model=model, trust_remote_code=True, enforce_eager=True, disable_log_stats=False, max_num_seqs=n, tensor_parallel_size=tp_size, disable_custom_all_reduce=True, gpu_memory_utilization=0.9) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. torch.cuda.synchronize() time1 = time.perf_counter() outputs = llm.generate(prompts_choose, sampling_params) torch.cuda.synchronize() time2 = time.perf_counter() print(f"\nllm.generate over. All Generate Time: {time2 - time1:.5f} s\n") # # Print the outputs. # for output in outputs: # prompt = output.prompt # generated_text = output.outputs[0].text # # print(f"Prompt: {prompt!r},\n") # print(f"Generated text: {generated_text!r}\n") def test(): parser = argparse.ArgumentParser(description='Test LLM') parser.add_argument('-n', type=int, default=4, help='Number of prompts') parser.add_argument('-max_tokens', type=int, default=16, help='Maximum number of tokens') parser.add_argument('-tp_size', type=int, default=1, help='Tensor Parallel Size') parser.add_argument('-model', type=str, help='Model path') args = parser.parse_args() n = args.n max_tokens = args.max_tokens tp_size = args.tp_size model = args.model test_llm(model, n, max_tokens, tp_size) test() ```run command is as follows:
When I use model: qwen2-72b-instruct, max_num_seqs=128, tensor_parallel_size=4, enforce_eager=True, prompts: vllm/benchmarks/sonnet.txt, it crashes inexplicably, with the error RuntimeError. CUDA error: an illegal memory access was encountered.
It has been verified to occur with batch_size=128 (batch_size 64, 256 are normal), max_tokens > 4, (max_tokens 4, 8, 16, 32, 64, 128, 256, etc. all happen).
Also using dummy_prompt = "hello " * 30 will also happen this error.
The error output is:
Seems to only happen when batch_size=128, the flash_attn backend
output[num_prefill_tokens:] = flash_attn_with_kvcache(...)
error, haven't figured out the cause of the error yet..