Open w013nad opened 5 months ago
Do you have prefix caching enabled? If so, might be the same issue as I have reported here in #5537
Do you have prefix caching enabled? If so, might be the same issue as I have reported here in #5537
I believe prefix caching is a default setting so yes. I did not explicitly enable it though.
I have the same problem. The model I used is Qwen2-72B-Instruct-GPTQ-Int4
. I tried vllm==0.5.0
and vllm==0.5.0.post1
. The input is 25 texts and the length of every text is about 2000. The error occurs during the inference process, but there is not any error when I predict every 10 texts together.The error can be reproduced every time and is blocked at 14/25.
Configuration is as follows:
max_model_len, tp_size = 8192, 1
model = LLM(
model=model_path,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
gpu_memory_utilization = 0.8,
dtype=model_dtype
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=8192, stop_token_ids=stop_token_ids)
The error information is as follows:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[30], line 34, in model_chat(model, tokenizer, model_input, max_new_tokens, **gen_args)
---> 34 generated_ids = model.generate(
35 prompts=all_inputs,
36 sampling_params=sampling_params
37 )
38 response = [generated_id.outputs[0].text for generated_id in generated_ids]
39 # print('response:{}'.format(response))
File /opt/conda/lib/python3.11/site-packages/vllm/utils.py:691, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
684 msg += f" {additional_message}"
686 warnings.warn(
687 DeprecationWarning(msg),
688 stacklevel=3, # The inner function takes up one level
689 )
--> 691 return fn(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/vllm/entrypoints/llm.py:304, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request)
296 sampling_params = SamplingParams()
298 self._validate_and_add_requests(
299 inputs=inputs,
300 params=sampling_params,
301 lora_request=lora_request,
302 )
--> 304 outputs = self._run_engine(use_tqdm=use_tqdm)
305 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /opt/conda/lib/python3.11/site-packages/vllm/entrypoints/llm.py:556, in LLM._run_engine(self, use_tqdm)
554 total_out_toks = 0
555 while self.llm_engine.has_unfinished_requests():
--> 556 step_outputs = self.llm_engine.step()
557 for output in step_outputs:
558 if output.finished:
File /opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py:776, in LLMEngine.step(self)
767 if not scheduler_outputs.is_empty():
768 execute_model_req = ExecuteModelRequest(
769 seq_group_metadata_list=seq_group_metadata_list,
770 blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
(...)
774 running_queue_size=scheduler_outputs.running_queue_size,
775 )
--> 776 output = self.model_executor.execute_model(
777 execute_model_req=execute_model_req)
778 else:
779 output = []
File /opt/conda/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:91, in GPUExecutor.execute_model(self, execute_model_req)
88 def execute_model(
89 self, execute_model_req: ExecuteModelRequest
90 ) -> List[Union[SamplerOutput, PoolerOutput]]:
---> 91 output = self.driver_worker.execute_model(execute_model_req)
92 return output
File /opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py:280, in Worker.execute_model(self, execute_model_req)
277 if num_seq_groups == 0:
278 return []
--> 280 output = self.model_runner.execute_model(seq_group_metadata_list,
281 self.gpu_cache)
283 # Worker only supports single-step execution. Wrap the output in a list
284 # to conform to interface.
285 return [output]
File /opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py:749, in ModelRunner.execute_model(self, seq_group_metadata_list, kv_caches)
746 else:
747 model_executable = self.model
--> 749 hidden_states = model_executable(
750 input_ids=input_tokens,
751 positions=input_positions,
752 kv_caches=kv_caches,
753 attn_metadata=attn_metadata,
754 **multi_modal_kwargs,
755 )
757 # Compute the logits.
758 logits = self.model.compute_logits(hidden_states, sampling_metadata)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py:330, in Qwen2ForCausalLM.forward(self, input_ids, positions, kv_caches, attn_metadata)
323 def forward(
324 self,
325 input_ids: torch.Tensor,
(...)
328 attn_metadata: AttentionMetadata,
329 ) -> torch.Tensor:
--> 330 hidden_states = self.model(input_ids, positions, kv_caches,
331 attn_metadata)
332 return hidden_states
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py:254, in Qwen2Model.forward(self, input_ids, positions, kv_caches, attn_metadata)
252 for i in range(len(self.layers)):
253 layer = self.layers[i]
--> 254 hidden_states, residual = layer(
255 positions,
256 hidden_states,
257 kv_caches[i],
258 attn_metadata,
259 residual,
260 )
261 hidden_states, _ = self.norm(hidden_states, residual)
262 return hidden_states
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py:206, in Qwen2DecoderLayer.forward(self, positions, hidden_states, kv_cache, attn_metadata, residual)
203 else:
204 hidden_states, residual = self.input_layernorm(
205 hidden_states, residual)
--> 206 hidden_states = self.self_attn(
207 positions=positions,
208 hidden_states=hidden_states,
209 kv_cache=kv_cache,
210 attn_metadata=attn_metadata,
211 )
213 # Fully Connected
214 hidden_states, residual = self.post_attention_layernorm(
215 hidden_states, residual)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py:153, in Qwen2Attention.forward(self, positions, hidden_states, kv_cache, attn_metadata)
151 q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
152 q, k = self.rotary_emb(positions, q, k)
--> 153 attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
154 output, _ = self.o_proj(attn_output)
155 return output
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File /opt/conda/lib/python3.11/site-packages/vllm/attention/layer.py:89, in Attention.forward(self, query, key, value, kv_cache, attn_metadata)
81 def forward(
82 self,
83 query: torch.Tensor,
(...)
87 attn_metadata: AttentionMetadata,
88 ) -> torch.Tensor:
---> 89 return self.impl.forward(query, key, value, kv_cache, attn_metadata,
90 self._kv_scale)
File /opt/conda/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py:355, in FlashAttentionImpl.forward(self, query, key, value, kv_cache, attn_metadata, kv_scale)
339 output[:num_prefill_tokens] = flash_attn_varlen_func(
340 q=query,
341 k=key_cache,
(...)
350 block_table=prefill_meta.block_tables,
351 )
353 if decode_meta := attn_metadata.decode_metadata:
354 # Decoding run.
--> 355 output[num_prefill_tokens:] = flash_attn_with_kvcache(
356 decode_query.unsqueeze(1),
357 key_cache,
358 value_cache,
359 block_table=decode_meta.block_tables,
360 cache_seqlens=decode_meta.seq_lens_tensor,
361 softmax_scale=self.scale,
362 causal=True,
363 alibi_slopes=self.alibi_slopes,
364 ).squeeze(1)
366 # Reshape the output tensor.
367 return output.view(num_tokens, hidden_size)
File /opt/conda/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py:1233, in flash_attn_with_kvcache(q, k_cache, v_cache, k, v, rotary_cos, rotary_sin, cache_seqlens, cache_batch_idx, block_table, softmax_scale, causal, window_size, rotary_interleaved, alibi_slopes, num_splits, out)
1231 cache_batch_idx = maybe_contiguous(cache_batch_idx)
1232 block_table = maybe_contiguous(block_table)
-> 1233 out, softmax_lse = flash_attn_cuda.fwd_kvcache(
1234 q,
1235 k_cache,
1236 v_cache,
1237 k,
1238 v,
1239 cache_seqlens,
1240 rotary_cos,
1241 rotary_sin,
1242 cache_batch_idx,
1243 block_table,
1244 alibi_slopes,
1245 out,
1246 softmax_scale,
1247 causal,
1248 window_size[0],
1249 window_size[1],
1250 rotary_interleaved,
1251 num_splits,
1252 )
1253 return out
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Hi @w013nad , have you resolved the issue? If so, what steps did you take to do so? If not, can you please provide steps to reproduce it using the following template (replacing the italicized values with your values)?
same issue
Same issue. I removed --enforce-eager --disable-custom-all-reduce and the error has not shown up again. I am not sure which argument leads to the error.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
Not entirely sure what caused this, the key point appears to be "illegal memory access". I know you had some issues with crashing with 0.5.0 so I wonder if this is related. I'm using the official docker container using the latest release vllm==0.5.0.post1. I launched the openai endpoint with
I made calls to it via the openai python package.
Server ran fine for about 20,000 calls of 1000-15,000 tokens then failed randomly at some time in the middle of the night. Regrettably, my python code kept running so I'm unable to find the exact message at the time that it failed. So that is stuck under another 30,000 failed calls.
Here is the error I get when I do a client completion. A small note is that I have noticed one of our GPUs had an ECC warning a few weeks ago but it was not the GPU I am using here. So, it could be something wrong with our server