Closed pseudotensor closed 6 days ago
Experiencing the same when using LoRa requests...
Experiencing the same when using LoRa requests...
Hi! The same with you when using LoRa. Do you have a solution?
When I load the llama model, some GPU will do this and others will be fine
I'm also seeing same issues on a clean server installation in GCP. My steps to reproduce were:
CUDA error: an illegal memory access was encountered
Still seeing this on Mixtral
INFO: 172.16.0.88:2118 - "POST /v1/completions HTTP/1.1" 200 OK
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c5ec7d7a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c5ec7d2ab25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c5ec818b718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7c5e7ba4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7c5e7ba4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7c5e7ba545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7c5e7ba5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7c5ec74b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7c5ec8a92ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7c5ec8b23a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c5ec7d7a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c5ec7d2ab25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c5ec818b718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7c5e7ba4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7c5e7ba4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7c5e7ba545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7c5e7ba5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7c5ec74b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7c5ec8a92ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7c5ec8b23a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[2024-05-17 07:35:09,516 E 1 6539] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
:
Still see on totally different H100 system
same problem here with H100s and latest vllm==0.4.2
@pseudotensor I have discovered an integer overflow in the fused_moe_kernel
, a Triton kernel called by MoE models. The overflow will sometimes cause CUDA illegal memory access issues. I don't know if this overflow is the cause of your failure, but since you are using the Mixtral model (a MoE), you might be affected. If you'd like to check, you can add the following assertion here:
tl.device_assert(off_experts * stride_be >= 0, "off_experts * stride_be overflows!")
and then rerun your program with the following envs (should be set in the docker)CUDA_LAUNCH_BLOCKING=1 TRITON_DEBUG=1
set, and with the flag --enforce-eager
passed to the docker entrypoint?
Same problem here when running llama-7b with input_len >= 4096 tensor_parallel_size > 1, on a800 * 8. Did anyone solve it?
@pseudotensor I have discovered an integer overflow in the
fused_moe_kernel
, a Triton kernel called by MoE models. The overflow will sometimes cause CUDA illegal memory access issues. I don't know if this overflow is the cause of your failure, but since you are using the Mixtral model (a MoE), you might be affected. If you'd like to check, you can add the following assertion here:tl.device_assert(off_experts * stride_be >= 0, "off_experts * stride_be overflows!")
and then rerun your program with the following envs (should be set in the docker)
CUDA_LAUNCH_BLOCKING=1 TRITON_DEBUG=1
set, and with the flag--enforce-eager
passed to the docker entrypoint?
same error occurs, did you solve it, or how to skip this ...
Still seeing this, only when using LoRa. I am currently using LLama3-8b, tensor_parallel_size=8
and max_model_len=1250
. The same run without using LoRa works flawlessly.
This might be related: https://stackoverflow.com/questions/68106457/pytorch-cuda-error-an-illegal-memory-access-was-encountered
The root problem could be OOM because of the prefix caching. The solution in the post above is to use torch.cuda.empty_cache()
so it would make sense
Closing just because original mixtral model no longer has been doing this after 0.4.3+.
Your current environment
how ran:
On:
🐛 Describe the bug
After 5 days of being up, eventually hit this. Note the endpoint was heavily used for all 5 days, nothing special apart from maybe more guided_json stuff today.