hyz-xmaster / VarifocalNet

VarifocalNet: An IoU-aware Dense Object Detector
Apache License 2.0
346 stars 52 forks source link

dist_train keep waiting with multiple GPUs and samples_per_gpu = 1 #4

Open zvadaszi opened 3 years ago

zvadaszi commented 3 years ago

First of all, thank you for your work and for your repo.

Environment:

pytorch 1.5.1 cuda 10.2 cudnn 7.6.5 mmdetection 2.3.0 4xV100 16GB

My config file is based on: vfnet_r50_fpn_mstrain_2x, modified to a custom dataset having large images (2560x1440) and mainly small objects 10-60px

  1. Training with multiple GPUs and samples_per_gpu = 1, workers_per_gpu = 1, train hangs at the beginning with all GPU_Util at 100%.

  2. Training with multiple GPUs, samples_per_gpu = 2, workers_per_gpu = 2 (and smaller image size) train goes well.

Somehow similar to this issue: 2193

hyz-xmaster commented 3 years ago

Hi @zvadaszi, thank you for your information. I think this bug is most likely related to those issues about ATSS. I have updated the repo according to those fixes. You may try it again.