RTDETRv2 training error

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [43,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [59,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [107,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [107,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [11,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [27,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

[rank0]:[E806 10:11:47.197701816 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered [0/940] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2d5e9af86 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7fe2d5e49d10 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7fe2d5f75f08 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fe287f683e6 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fe287f6d600 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fe287f742ba in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe287f766fc in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fe2d56c7bf4 in /root/anaconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fe2d6bd9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fe2d6c6b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

E0806 10:11:47.450000 139709132887872 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 100) of binary: /root/anaconda3/bin/python Traceback (most recent call last):
File "/root/anaconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-08-06_10:11:47 host : Lab-PC rank : 0 (local_rank: 0) exitcode : -6 (pid: 100) error_file: traceback : Signal 6 (SIGABRT) received by PID 100 Environment ubuntu 22.04, GeForce RTX3090, Driver 550.67, cuda 12.4, python3.12.4, torch2.4.0, torchvision0.19.0 Execution Command CUDA_VISIBLE_DEVICES=0 torchrun --master_port=9909 tools/train.py -c configs/rtdetrv2/rtdetrv2_r50vd_m_7x_coco.yml --seed=0 I try to use custom coco dataset for training, so I only modified the rtdetrv2_pytorch/configs/dataset/coco_detection.yml with the corresponding "num_classes" and "remap_mscoco_category: False". I'm not sure whether there's any other configs that need to modify which causes the above corruption. Would anyone give a hint about how the above happened and how to solve it, that would be very helpful, thanks!

lyuwenyu / RT-DETR

RTDETRv2 training error #406