[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access

Sekri0 commented 2 months ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

🐛 Describe the bug

--enable-prefix-caching causing CUDA error: illegal memory access. According to the trackback, this bug seems to be caused by FlashAttention. I notice that PR 7018 and PR 7142 seem to have fixed the problem, but vLLM 0.5.5 still have this bug.

Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. , Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process output = executor(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop output = self.execute_model(execute_model_req=None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 322, in execute_model output = self.model_runner.execute_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward hidden_states, residual = layer( ^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward hidden_states = self.self_attn( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. , Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process output = executor(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop output = self.execute_model(execute_model_req=None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 322, in execute_model output = self.model_runner.execute_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward hidden_states, residual = layer( ^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward hidden_states = self.self_attn( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward return self.impl.forward(query, ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 692, in forward num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/ops.py", line 1061, in call return self._op(args, (kwargs or {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl result = self._backend_fns[device_type](*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func return _flash_attn_varlen_func( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func return FlashAttnVarlenFunc.apply( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply return super().apply(args, **kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward return self.impl.forward(query, ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 692, in forward num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/ops.py", line 1061, in call return self._op(*args, (kwargs or {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl result = self._backend_fns[device_type](*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func return _flash_attn_varlen_func( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func return FlashAttnVarlenFunc.apply( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply return super().apply(args, kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Sekri0 commented 2 months ago

@zachzzc @raywanb Sorry to bother you guys, could you please take a look at this problem?

flexorRegev commented 2 months ago

I think I have the same bug (also running 0.5.5) when running tp = 2 and using Neuralmagic quantized 8bit fp8 llama 3.1 70b this happens rarely but some times it does (2x H100)

zachzzc commented 2 months ago

Can you provide a minimum script to reproduce your problem ? @Sekri0

flexorRegev commented 2 months ago

I can provide one for mine - but it happens statistically and I couldn't find the exact thing that causes it

Sekri0 commented 2 months ago

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

zachzzc commented 1 month ago

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

it would be helpful if you can log/print the input of the flash attention at https://github.com/vllm-project/vllm/blob/6234385f4a826edd5c4e0ca7dbdea480be215c5e/vllm/attention/backends/flash_attn.py#L692 when it failed, specifically

The shape of

                           q=query,
                           k=key_cache,
                           v=value_cache,

And the values of

                           cu_seqlens_q=prefill_meta.query_start_loc,
                           max_seqlen_q=prefill_meta.max_query_len,
                           cu_seqlens_k=prefill_meta.seq_start_loc,
                           max_seqlen_k=max_seq_len,
                           block_table=prefill_meta.block_tables,

Sekri0 commented 1 month ago

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

it would be helpful if you can log/print the input of the flash attention at

https://github.com/vllm-project/vllm/blob/6234385f4a826edd5c4e0ca7dbdea480be215c5e/vllm/attention/backends/flash_attn.py#L692

when it failed, specifically The shape of
                           q=query,
                           k=key_cache,
                           v=value_cache,
And the values of
                           cu_seqlens_q=prefill_meta.query_start_loc,
                           max_seqlen_q=prefill_meta.max_query_len,
                           cu_seqlens_k=prefill_meta.seq_start_loc,
                           max_seqlen_k=max_seq_len,
                           block_table=prefill_meta.block_tables,

I found that sending a few specific requests at certain time intervals can reproduce this bug. I have printed and logged the few messages you mentioned. I am sorry that I cannot provide the specific content of the request for the time being because it contains some sensitive information.I will try to construct some non-confidential requests later. q_shape: torch.Size([1620, 16, 128]), k_shape: torch.Size([17554, 16, 2, 128]), v_shape: torch.Size([17554, 16, 2, 128]) cu_seqlens_q: tensor([ 0, 68, 390, 1620], device='cuda:0', dtype=torch.int32), max_seqlen_q: 1230, cu_seqlens_k: tensor([ 0, 196, 646, 2004], device='cuda:0', dtype=torch.int32) max_seqlen_k: 1358 block_table: tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 14, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 1, 2, 3, 4, 5, 6, 7, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113]], device='cuda:0', dtype=torch.int32)

vllm-project / vllm

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

Your current environment

🐛 Describe the bug