hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
22.14k stars 2.16k forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #694

Closed CacacaLalala closed 1 month ago

CacacaLalala commented 2 months ago

您好,非常感谢您开源这么棒的项目,我在使用代码进行多机训练的时候,会经常出现RuntimeError: CUDA error: an illegal memory access was encountered 这一问题,并且出现的十分随机,请问这个报错是因为内存溢出吗?还是因为其他什么原因? 详细的报错如下,已经打开了export CUDA LAUNCH BLOCKING=1

Epoch 0: 35%|__ | 759/2142 [53:27<19:18:44, 50.27s/it, loss=0.457, step=749, global_step=749]Traceback (most recent call last):
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 551, in
main()
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 371, in main
booster.backward(loss=loss, optimizer=optimizer)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/booster/booster.py", line 176, in backward
optimizer.backward(loss)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 536, in backward
loss.backward(retain_graph=retain_graph)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/autograd/
init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 272, in grad_handler
LowLevelZeroOptimizer.add_to_bucket(param, group_id, bucket_store, param_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 519, in add_to_bucket
LowLevelZeroOptimizer.run_reduction(bucket_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 297, in run_reduction
bucket_store.build_grad_in_bucket()
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/bucket_store.py", line 106, in build_grad_in_bucket
grad_current_rank = grad_list[rank].clone().detach()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank63]:[E ProcessGroupNCCL.cpp:1182] [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f83f84d87 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f2f83f3575f in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f2f840558a8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f2f851283ac in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f2f8512c4c8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f2f8512fbfa in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f85130839 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b55 (0x7f2fcee5ab55 in /root/miniconda3/envs/opensora/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2fcffcb609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2fcfd96133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

期待您的回复!

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

JonathanLi19 commented 1 month ago

遇到了同样的问题,请问你解决了吗?