多卡训练中途显卡卡死

leo-xuxl commented 4 months ago

Star RTDETR 请先在RTDETR主页点击star以支持本项目 Star RTDETR to help more people discover this project.

Describe the bug 在drone_detection数据集上使用多卡训练时，第一轮训练中途显卡利用率卡在100%，然后超时报错，但单卡训练正常；使用coco数据集多卡训练正常。

To Reproduce 仅修改config文件 num_classes: 5 remap_mscoco_category: False

train_dataloader: type: DataLoader dataset: type: CocoDetection img_folder: /raid/stu/datasets/drone_detection_coco/train ann_file: /raid/stu/datasets/drone_detection_coco/annotations/train.json transforms: type: Compose ops: ~ shuffle: True batch_size: 8 num_workers: 4 drop_last: True

val_dataloader: type: DataLoader dataset: type: CocoDetection img_folder: /raid/stu/datasets/drone_detection_coco/valid ann_file: /raid/stu/datasets/drone_detection_coco/annotations/val.json transforms: type: Compose ops: ~

报错信息： RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=700635, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809785 milliseconds before timing out

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1456404 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1456401) of binary: /opt/conda/bin/python

wannabetter commented 3 weeks ago

我也是捣鼓这个问题好久了，请问你解决了吗？

leo-xuxl commented 2 weeks ago

我也是捣鼓这个问题好久了，请问你解决了吗？

可以参考一下这个问题，https://github.com/lyuwenyu/RT-DETR/issues/242

lyuwenyu / RT-DETR

多卡训练中途显卡卡死 #206