dist_train keep waiting with multiple GPUs and samples_per_gpu = 1

First of all, thank you for your work and for your repo.

Environment:

pytorch 1.5.1 cuda 10.2 cudnn 7.6.5 mmdetection 2.3.0 4xV100 16GB

My config file is based on: vfnet_r50_fpn_mstrain_2x, modified to a custom dataset having large images (2560x1440) and mainly small objects 10-60px

Training with multiple GPUs and samples_per_gpu = 1, workers_per_gpu = 1, train hangs at the beginning with all GPU_Util at 100%.
Training with multiple GPUs, samples_per_gpu = 2, workers_per_gpu = 2 (and smaller image size) train goes well.

Somehow similar to this issue: 2193

hyz-xmaster / VarifocalNet