lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Apache License 2.0
1.8k stars 186 forks source link

多卡训练中途显卡卡死 #206

Open leo-xuxl opened 4 months ago

leo-xuxl commented 4 months ago

Star RTDETR 请先在RTDETR主页点击star以支持本项目 Star RTDETR to help more people discover this project.


Describe the bug 在drone_detection数据集上使用多卡训练时,第一轮训练中途显卡利用率卡在100%,然后超时报错,但单卡训练正常;使用coco数据集多卡训练正常。

To Reproduce 仅修改config文件 num_classes: 5 remap_mscoco_category: False

train_dataloader: type: DataLoader dataset: type: CocoDetection img_folder: /raid/stu/datasets/drone_detection_coco/train ann_file: /raid/stu/datasets/drone_detection_coco/annotations/train.json transforms: type: Compose ops: ~ shuffle: True batch_size: 8 num_workers: 4 drop_last: True

val_dataloader: type: DataLoader dataset: type: CocoDetection img_folder: /raid/stu/datasets/drone_detection_coco/valid ann_file: /raid/stu/datasets/drone_detection_coco/annotations/val.json transforms: type: Compose ops: ~

报错信息: RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=700635, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809785 milliseconds before timing out

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1456404 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1456401) of binary: /opt/conda/bin/python

wannabetter commented 3 weeks ago

我也是捣鼓这个问题好久了,请问你解决了吗?

leo-xuxl commented 2 weeks ago

我也是捣鼓这个问题好久了,请问你解决了吗?

可以参考一下这个问题,https://github.com/lyuwenyu/RT-DETR/issues/242