sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
Apache License 2.0
4.84k stars 326 forks source link

RuntimeError: CUDA error: device-side assert triggered when running #271

Open aliencaocao opened 5 months ago

aliencaocao commented 5 months ago
[2024-03-10 10:31:21,586] [   ERROR] model_rpc.py:178 - Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 187, in forward_step
    new_batch = self.get_new_fill_batch()
  File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 285, in get_new_fill_batch
    prefix_indices, last_node = self.tree_cache.match_prefix(req.input_ids)
  File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/radix_cache.py", line 52, in match_prefix
    value = torch.concat(value)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Not out of VRAM because i set mem_fraction to 0.6 and it is only using 10/16gb on my V100 using torch 2.2.1 (custom built vllm on torch 2.2.1), llava1.6 7b with gptq 8bit (yes vllm recently merged support for gptq 8bit)

This happens at a random time after running 100-300 samples. Because its completely random, it cant be data issue

Cant seem to get rid of the error no matter what I do.

aliencaocao commented 5 months ago

This happens when im using run batch and effective batch size >1. Looks like a race cond somewhere

Reichenbachian commented 5 months ago

Bump on this. We're running into it too. Would appreciate guidance.

vonchenplus commented 5 months ago

Same problem and there are several other places where this happens.

m0g1cian commented 3 months ago

I think it is due to misconfig of maximum context length in SGLang

https://github.com/sgl-project/sglang/issues/461#issuecomment-2123974167

aliencaocao commented 3 months ago

@m0g1cian different issue here. I am getting value = torch.concat(value), not req_to_token. The model I use also don't have the ctx len mismatch issue here.

m0g1cian commented 3 months ago

@m0g1cian different issue here. I am getting value = torch.concat(value), not req_to_token. The model I use also don't have the ctx len mismatch issue here.

I think I had a similar issue. The symptom I had involve multiple CUDA errors lol, and I found those errors are pretty consistent with those extra long prompts. I had no issue when I was running Qwen 1.5 (32K context len), but when I switched to Llama 3 (8K context len), I started to have these errors.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 196, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 226, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 551, in forward_decode_batch
    ) = self.model_runner.forward(batch, ForwardMode.DECODE)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 452, in forward
    return self.forward_decode(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 407, in forward_decode
    input_metadata = InputMetadata.create(
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 191, in create
    total_num_tokens = int(torch.sum(seq_lens))
                       ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 196, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 226, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 551, in forward_decode_batch
    ) = self.model_runner.forward(batch, ForwardMode.DECODE)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 452, in forward
    return self.forward_decode(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 407, in forward_decode
    input_metadata = InputMetadata.create(
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 191, in create
    total_num_tokens = int(torch.sum(seq_lens))
                       ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

INFO:     127.0.0.1:37816 - "POST /generate HTTP/1.1" 200 OK

 65%|██████▌   | 13/20 [01:07<00:07,  1.03s/it][rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 1 Rank 1] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f37183d0e33 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 196, in exposed_step
    self.forward_step()
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 207, in forward_step
    new_batch = self.get_new_fill_batch()
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 313, in get_new_fill_batch
    prefix_indices, last_node = self.tree_cache.match_prefix(req.input_ids)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/radix_cache.py", line 54, in match_prefix
    value = torch.concat(value)
            ^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f37183d0e33 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
aliencaocao commented 3 months ago

Im using at 4x ctx len