open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.02k stars 9.36k forks source link

VarifocalNet dist_train keep waiting #4535

Closed FX-STAR closed 3 years ago

FX-STAR commented 3 years ago

My env: cuda10.2 torch==1.6.0 mmdetection==2.8.0 mmcv==1.2.4 After some iters the GPU-Util 100% but the process is always waiting Any advice? Thanks

v-qjqs commented 3 years ago

Hi @whoNamedCody, would using smaller batch_size helps? I guess it might be caused by out of CUDA memory?

FX-STAR commented 3 years ago

I'm sure Memory-Usage not full, and training on single gpu is normal

ZwwWayne commented 3 years ago

The stuck sometimes is caused by the different computation graphs in different GPUS and this usually happens when you are training empty GT images. You may check that first.

JerExJs commented 3 years ago

The stuck sometimes is caused by the different computation graphs in different GPUS and this usually happens when you are training empty GT images. You may check that first.

Hi, @ZwwWayne, I meet the same problem.

when I set filter_empty_gt=False (because there are many empty GT images in my custom dataset), dist_train for FCOS also keep waiting, so is there any way and plan to solve it?

Thanks.

chenzyhust commented 3 years ago

The stuck sometimes is caused by the different computation graphs in different GPUS and this usually happens when you are training empty GT images. You may check that first.

I meet the same problem in training resnest.Could you help us solve the problem

chenzyhust commented 3 years ago

@ZwwWayne

ZwwWayne commented 3 years ago

The implementation might not be so robust, we will check it out.