报错信息:
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=700635, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809785 milliseconds before timing out
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1456404 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1456401) of binary: /opt/conda/bin/python
Star RTDETR 请先在RTDETR主页点击star以支持本项目 Star RTDETR to help more people discover this project.
Describe the bug 在drone_detection数据集上使用多卡训练时,第一轮训练中途显卡利用率卡在100%,然后超时报错,但单卡训练正常;使用coco数据集多卡训练正常。
To Reproduce 仅修改config文件 num_classes: 5 remap_mscoco_category: False
train_dataloader: type: DataLoader dataset: type: CocoDetection img_folder: /raid/stu/datasets/drone_detection_coco/train ann_file: /raid/stu/datasets/drone_detection_coco/annotations/train.json transforms: type: Compose ops: ~ shuffle: True batch_size: 8 num_workers: 4 drop_last: True
val_dataloader: type: DataLoader dataset: type: CocoDetection img_folder: /raid/stu/datasets/drone_detection_coco/valid ann_file: /raid/stu/datasets/drone_detection_coco/annotations/val.json transforms: type: Compose ops: ~
报错信息: RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=700635, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809785 milliseconds before timing out
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1456404 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1456401) of binary: /opt/conda/bin/python