[Bug] illegal memory access encountered

wonderisland commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[X] 5. Please use English, otherwise it will be closed.

Describe the bug

bugs: dlccdefifz69nk44-master-0:32478:34896 [2] NCCL INFO comm 0x55b978e00250 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 30 commId 0xc9c72df61c4f59e1 - Init COMPLETE INFO: 127.0.0.1:50812 - "POST /generate HTTP/1.1" 200 OK [15:57:49] The server is fired up and ready to roll! ^@^@^@[15:58:39 TP0] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 1, cache hit rate: 5.88%, #running-req: 0, #queue-req: 0 [15:58:51 TP0] Decode batch. #running-req: 1, #token: 43, token usage: 0.00, gen throughput (token/s): 0.53, #queue-req: 0 ^@[15:58:58 TP0] Decode batch. #running-req: 1, #token: 83, token usage: 0.00, gen throughput (token/s): 6.20, #queue-req: 0 [15:59:04 TP0] Decode batch. #running-req: 1, #token: 123, token usage: 0.00, gen throughput (token/s): 6.18, #queue-req: 0 [15:59:11 TP0] Decode batch. #running-req: 1, #token: 163, token usage: 0.00, gen throughput (token/s): 5.83, #queue-req: 0 ^@[rank6]:[E919 15:59:14.635045572 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3fe4177f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f3fe4126d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f4004055f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3f7617f3e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f3f76184600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3f7618b2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f7618d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdc253 (0x7f4018529253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: + 0x94ac3 (0x7f40725b8ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f4072649bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' [rank3]:[E919 15:59:14.635224142 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3fe4177f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f3fe4126d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f4004055f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3f7617f3e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f3f76184600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3f7618b2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3f7618d6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdc253 (0x7f4018529253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: + 0x94ac3 (0x7f40725b8ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f4072649bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E919 15:59:14.635248625 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

and cuda graph is enabled , mla enable then : [17:39:38 TP3] Load weight begin. avail mem=77.08 GB [2024-09-19 17:39:43,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,207] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,274] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,287] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,291] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,314] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,317] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-19 17:39:43,317] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Cache shape torch.Size([163840, 64]) Loading safetensors checkpoint shards: 0% Completed | 0/55 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 2% Completed | 1/55 [00:06<06:05, 6.77s/it] Loading safetensors checkpoint shards: 4% Completed | 2/55 [00:11<04:46, 5.40s/it] Loading safetensors checkpoint shards: 5% Completed | 3/55 [00:15<04:18, 4.97s/it] Loading safetensors checkpoint shards: 7% Completed | 4/55 [00:20<04:10, 4.91s/it] Loading safetensors checkpoint shards: 9% Completed | 5/55 [00:25<04:00, 4.82s/it] Loading safetensors checkpoint shards: 11% Completed | 6/55 [00:29<03:50, 4.71s/it] Loading safetensors checkpoint shards: 13% Completed | 7/55 [00:33<03:38, 4.56s/it] Loading safetensors checkpoint shards: 15% Completed | 8/55 [00:38<03:41, 4.71s/it] Loading safetensors checkpoint shards: 16% Completed | 9/55 [00:43<03:36, 4.71s/it] Loading safetensors checkpoint shards: 18% Completed | 10/55 [00:47<03:25, 4.57s/it] Loading safetensors checkpoint shards: 20% Completed | 11/55 [00:52<03:16, 4.47s/it] Loading safetensors checkpoint shards: 22% Completed | 12/55 [00:56<03:11, 4.46s/it] Loading safetensors checkpoint shards: 24% Completed | 13/55 [01:01<03:10, 4.53s/it] Loading safetensors checkpoint shards: 25% Completed | 14/55 [01:05<03:05, 4.52s/it] Loading safetensors checkpoint shards: 27% Completed | 15/55 [01:10<02:58, 4.47s/it] Loading safetensors checkpoint shards: 29% Completed | 16/55 [01:14<02:53, 4.44s/it] Loading safetensors checkpoint shards: 31% Completed | 17/55 [01:19<02:55, 4.61s/it] Loading safetensors checkpoint shards: 33% Completed | 18/55 [01:23<02:49, 4.57s/it] Loading safetensors checkpoint shards: 35% Completed | 19/55 [01:28<02:44, 4.57s/it] Loading safetensors checkpoint shards: 36% Completed | 20/55 [01:33<02:39, 4.56s/it] Loading safetensors checkpoint shards: 38% Completed | 21/55 [01:37<02:35, 4.56s/it] Loading safetensors checkpoint shards: 40% Completed | 22/55 [01:41<02:28, 4.50s/it] Loading safetensors checkpoint shards: 42% Completed | 23/55 [01:46<02:24, 4.50s/it] Loading safetensors checkpoint shards: 44% Completed | 24/55 [01:51<02:20, 4.52s/it] Loading safetensors checkpoint shards: 45% Completed | 25/55 [01:55<02:13, 4.46s/it] Loading safetensors checkpoint shards: 47% Completed | 26/55 [01:59<02:07, 4.41s/it] Loading safetensors checkpoint shards: 49% Completed | 27/55 [02:03<01:57, 4.20s/it] Loading safetensors checkpoint shards: 51% Completed | 28/55 [02:07<01:55, 4.29s/it] Loading safetensors checkpoint shards: 53% Completed | 29/55 [02:11<01:47, 4.13s/it] Loading safetensors checkpoint shards: 55% Completed | 30/55 [02:15<01:39, 3.98s/it] Loading safetensors checkpoint shards: 56% Completed | 31/55 [02:18<01:33, 3.91s/it] Loading safetensors checkpoint shards: 58% Completed | 32/55 [02:22<01:29, 3.87s/it] Loading safetensors checkpoint shards: 60% Completed | 33/55 [02:26<01:22, 3.75s/it] Loading safetensors checkpoint shards: 62% Completed | 34/55 [02:29<01:17, 3.69s/it] Loading safetensors checkpoint shards: 64% Completed | 35/55 [02:33<01:13, 3.67s/it] Loading safetensors checkpoint shards: 65% Completed | 36/55 [02:37<01:09, 3.65s/it] Loading safetensors checkpoint shards: 67% Completed | 37/55 [02:40<01:05, 3.62s/it] Loading safetensors checkpoint shards: 69% Completed | 38/55 [02:43<00:59, 3.50s/it] Loading safetensors checkpoint shards: 71% Completed | 39/55 [02:47<00:56, 3.52s/it] Loading safetensors checkpoint shards: 73% Completed | 40/55 [02:50<00:52, 3.50s/it] Loading safetensors checkpoint shards: 75% Completed | 41/55 [02:54<00:49, 3.51s/it] Loading safetensors checkpoint shards: 76% Completed | 42/55 [02:57<00:45, 3.52s/it] Loading safetensors checkpoint shards: 78% Completed | 43/55 [03:01<00:42, 3.54s/it] Loading safetensors checkpoint shards: 80% Completed | 44/55 [03:04<00:38, 3.50s/it] Loading safetensors checkpoint shards: 82% Completed | 45/55 [03:08<00:34, 3.47s/it] Loading safetensors checkpoint shards: 84% Completed | 46/55 [03:11<00:31, 3.46s/it] Loading safetensors checkpoint shards: 85% Completed | 47/55 [03:15<00:27, 3.48s/it] Loading safetensors checkpoint shards: 87% Completed | 48/55 [03:18<00:24, 3.52s/it] Loading safetensors checkpoint shards: 89% Completed | 49/55 [03:22<00:21, 3.54s/it] Loading safetensors checkpoint shards: 91% Completed | 50/55 [03:25<00:17, 3.53s/it] Loading safetensors checkpoint shards: 93% Completed | 51/55 [03:29<00:14, 3.51s/it] Loading safetensors checkpoint shards: 95% Completed | 52/55 [03:32<00:09, 3.32s/it] Loading safetensors checkpoint shards: 96% Completed | 53/55 [03:35<00:06, 3.18s/it] ^@[17:43:23 TP6] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.62 GB [17:43:23 TP7] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.91 GB Loading safetensors checkpoint shards: 98% Completed | 54/55 [03:38<00:03, 3.20s/it] [17:43:23 TP4] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.62 GB Loading safetensors checkpoint shards: 100% Completed | 55/55 [03:41<00:00, 3.24s/it] Loading safetensors checkpoint shards: 100% Completed | 55/55 [03:41<00:00, 4.03s/it]

[17:43:27 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.62 GB [17:43:27 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.62 GB [17:43:27 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.91 GB [17:43:27 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.62 GB [17:43:28 TP5] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=20.62 GB [17:43:28 TP1] Memory pool end. avail mem=18.20 GB [17:43:28 TP5] Memory pool end. avail mem=18.20 GB [17:43:28 TP4] Memory pool end. avail mem=18.20 GB [17:43:28 TP6] Memory pool end. avail mem=18.20 GB [17:43:28 TP2] Memory pool end. avail mem=18.20 GB [17:43:28 TP3] Memory pool end. avail mem=18.20 GB [17:43:28 TP0] Memory pool end. avail mem=18.48 GB [17:43:28 TP7] Memory pool end. avail mem=18.48 GB [17:43:28 TP4] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP1] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP3] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP5] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP2] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP6] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP0] Capture cuda graph begin. This can take up to several minutes. [17:43:28 TP7] Capture cuda graph begin. This can take up to several minutes. [17:43:35 TP3] Exception in run_tp_server: Traceback (most recent call last): File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 146, in init self.capture() File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 175, in capture ) = self.capture_one_batch_size(bs, forward) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 216, in capture_one_batch_size run_once() File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 210, in run_once return forward(input_ids, input_metadata.positions, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 663, in forward hidden_states = self.model(input_ids, positions, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 632, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 474, in forward attn_output = self.attn(q_input, k_input, v_input, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward return input_metadata.attn_backend.forward(q, k, v, self, input_metadata) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/layers/attention_backend.py", line 69, in forward return self.forward_decode(q, k, v, layer, input_metadata) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/layers/attention_backend.py", line 466, in forward_decode self.decode_attention_fwd( File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/layers/triton_attention/decode_attention.py", line 623, in decode_attention_fwd _decode_grouped_softmax_reducev_fwd( File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/layers/triton_attention/decode_attention.py", line 545, in _decode_grouped_softmax_reducev_fwd _fwd_grouped_kernel_stage2[grid]( File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in return lambda *args, kwargs: self.run(grid=grid, warmup=False, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call self.launch(args, kwargs) RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 959, in run_tp_server model_server = ModelTpServer( File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 100, in init self.model_runner = ModelRunner( File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 128, in init self.init_cuda_graphs() File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 468, in init_cuda_graphs self.cuda_graph_runner = CudaGraphRunner(self) File "/mnt/data/hectorgao/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 148, in init raise Exception( Exception: Capture cuda graph failed: Triton Error [CUDA]: an illegal memory access was encountered Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value
disable torch compile by not using --enable-torch-compile Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Reproduction

http server: export CUDA_LAUNCH_BLOCKING=1 python -m sglang.launch_server --model-path /mnt/data/models/DeepSeek-V2.5 --tp 8 --mem-fraction-static 0.8 --enable-mla --trust-remote-code --port 30000 --disable-cuda-graph because cuda-graph need more gpu memory, so I choose disable, just inference bigger seq len response, do not care about the latency

http request: curl http://localhost:30000/generate \ -H "Content-Type: application/json" \ -d '{ "text": "please repeat output word ok, do not stop", "sampling_params": { "max_new_tokens": 16000, "temperature": 0.8, "repetition_penalty": 0.1 } }'

Environment

env: sglang vesion: https://github.com/sgl-project/sglang/releases/tag/v0.3.1.post1 GPU: A800

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ os: Linux dlccdefifz69nk44-master-0 4.19.91-014.15-kangaroo.alios7.x86_64 #1 SMP Wed Jul 10 15:22:10 CST 2024 x86_64 x86_64 x86_64 GNU/Linux

York-Cheung commented 1 month ago

same, do you solve it?

fengyang95 commented 1 month ago

Setting the BLOCK_H to 32 on line 536 might be useful. https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/triton_attention/decode_attention.py

merrymercy commented 1 month ago

Thanks for reporting this. It has been fixed by https://github.com/sgl-project/sglang/commit/a68cb201dd5f4ae6155b324d22054bbb0de15fba. We also released a new version for this fix. Can you try v0.3.1.post3?

sgl-project / sglang