Closed csvance closed 2 years ago
@csvance We recently fixed the deadlock problem of dataloader in IterBasedRunner, please refer to https://github.com/open-mmlab/mmcv/pull/1442. But your code is the latest, so you can try to increase the sleep time first.
Hi @hhaAndroid, I tried increasing the sleep to 60 seconds but I still get a hang at the same place. I also tried the epoch based runner and get a hang there after a certain number of epochs.
Unfortunately I have not been successful in my efforts to get a stack trace for any thread/process other than the mmcv IterBasedRunner. The good news is for my problem training on a single GPU is sufficient, so I can continue to move forward using mmdetection.
My best guess is there is some sort of bug in PyTorch dataloader/DDP which is causing this problem rather than an issue with the logic of mmcv/mmdetection.
I just realized I had forgotten to configure the number of classes in the ROI head! I tried changing the number of steps to run and ran training non distributed in a debugger and I got an exception in the COCO dataset class! If this is the root cause of my problem, there maybe be something going wrong with exception handling. I will continue to dig into this and update here.
EDIT:
Still get a deadlock with distributed training, but going to keep digging.
@csvance
Hello, when I was training with my own dataset, the deadlock phenomenon also occurred. Have you solved it now?
Hi @Yuting-Gao, only way I could avoid it was doing single GPU training (no DDP, single GPU DDP also deadlocks). Luckily my problem is fine tuning with 2000 image dataset, so single GPU is not a problem. Still no idea what the root cause is, but I suspect it has something to do with DDP specifically.
Describe the bug Custom COCO format dataset always hangs at fourth DistEvalHook using IterBasedRunner when using distributed training. The exact place that things get stuck is when the evaluation loop tries to get the next batch from the dataloader (for loop). This happens when using >= 1 GPUS, I tested with both 1 and 2 GPU (RTX 2080 Ti) with 0 <= workers_per_gpu <= 12. I also played around with ulimits / shared memory size / OpenCV thread count, nothing makes a difference.
The bug does not happen without distributed training. I think it may be related to this issue, but am not 100% sure: https://github.com/pytorch/pytorch/issues/1355
I am new to using mmcv/mmdetection so maybe I am missing something obvious, but I read the FAQ, looked through issues but didn't find anything definitive.
Was wondering if anyone in the community has a specific docker container they use for experiments and can confirm is working reliably with mmdet such as nVidia's PyTorch NGC, it could be my problem has to do with my system configuration, but this is the first time I have seen this type of problem after running all kinds of multi GPU / multi node experiments and not experiencing any dataloader deadlocks.
Reproduction
My config slightly modifies coco_detection.py config:
Custom COCO format dataset with 1170 images, 0.7/0.2/0.1 train/eval/test split.
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here.Installed PyTorch from Anaconda. I also tried PyTorch 1.8.1 but get the same deadlock as in 1.10.0. Tried both CUDA 10.2 and 11.3.
Error traceback There is no traceback, training simply deadlocks
Bug fix Currently I just do training non distributed, however this is not ideal.