Open linnanwang opened 2 years ago
@linnanwang Thanks for your feedback. Could you please help me to verify it @ayulockin? My machines are all disconnected from the network, there is no way to verify.
@linnanwang Are you referring to the configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py
added by the MMDetWandbHook configuration?
@hhaAndroid Thanks for the quick response. I'm referring to ./configs/yolox/yolox_s_8x8_300e_coco.py.
Hey @linnanwang thanks for raising this. So as I understand, using MMDetWandbHook is not working properly in a multi-GPU setting.
Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.
Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.
Hi @ayulockin I'm facing the same issue here. There was no bug reported but I found the GPU utilization went to 100% with 0 "GPU Time Spent Accessing Memory" (which indicated the deadlock if I understand correctly?). Then the run hung-up.
It happened by the last iteration of the 1st epoch. I was wondering if this issue was related to #6486 until I came here.
By the way, is the one from mmcv (https://github.com/open-mmlab/mmcv/blob/ea173c9f07f0abf6873d2b7d786fb6411843cf00/mmcv/runner/hooks/logger/wandb.py) workable with multi-GPU?
Thanks for more info @Fizzez.
Ideally MMCV's WandbLoggerHook
should work in a multi GPU setting given the methods are decorated with @master_only
. The same applies to this hook. I am investigating.
@ayulockin Thank you for the quick reply.
Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only.
I see. I thought the MMDetWandbHook
could not run with multi GPU env. My mistake.
If it's possible, could you please also share any ideas may point to this issue? Actually I am working on it and need a quick fix.
@ayulockin Thank you very much for helping to check.
Hey @linnanwang, @Fizzez, I tried training the model on 2 P100 GPUs by doing this: bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2
I couldn't reproduce the deadlock issue. If you check out the system metrics in this W&B run page, you will see that memory is allocated for both the GPUs and that both are used for training.
perhaps it is a unique problem of ./configs/yolox/yolox_s_8x8_300e_coco.py?
In my case I used ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
Also I have checked running with MMCV's WandbLoggerHook
was fine. It may not be a unique problem ..
Perhaps this is a problem of this particular model, could you take a look at yolox_s_8x8_300e_coco?
I experience same phenomena(deadlock over 30 minute) on dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py
, only on distributed learning(and more than 1 gpu) setting.
1 gpu training is okay.
I faced the same problem with mmseg
using MMSegWandbHook
(https://github.com/open-mmlab/mmsegmentation/issues/2137). Oddly, this always occurs right after the first output of the TextLoggerHook
In my case, I found out that putting MMSegWandbHook
before other loggers (i.e., TextLoggerHook
) solved the issue of deadlock. Maybe this might solve your issue as well if you put MMDetWandbHook
on top @linnanwang @MilkClouds
Finally managed to solve this by setting reset_flag=True
in TextLoggerHook
. (Although it's kind of an ugly fix..)
i.e. use config like the following:
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook', reset_flag=True),
dict(type='MMDetWandbHook', 'Set MMDetWandbHook properly here')
])
As far as I investigated, the deadlock was caused by dist.all_reduce()
here:
https://github.com/open-mmlab/mmdetection/blob/v2.25.0/mmdet/models/detectors/base.py#L204
The reason seems to be:
reset_flag
of the logger hook which has the lowest priority should be set to True
in order to clear the runner.log_buffer
after logging is done.
MMDetWandbHook
after the TextLoggerHook
in your config file. MMDetWandbHook
has the lowest priority thus its reset_flag
is True
, in all processes for all the GPUs. reset_flag=True
for MMDetWandbHook
is fine on GPU 0, the real LAST HOOK on GPUs besides 0, TextLoggerHook
, didn't have reset_flag=True
properly set.
MMDetWandbHook
is basically master_only
TextLoggerHook
never clear runner.log_buffer
and execute dist.reduce(mem_mb, 0, op=dist.ReduceOp.MAX)
to collect memory size, ignoring the logging interval. dist.reduce
seems to have blocked dist.all_reduce()
which I mentioned in the beginning... But I'm not quite sure.
p.s. I used the following environment
./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
Hi, I had different type of error but seems to occur because of same reason.
I ran bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2
and met error saying AssertionError: loss log variables are different across GPUs!
occured.
here mask_rcnn_r50_fpn_1x_wandb_coco.py.txt is a log text.
And this error occur right after 50 iteration. (if you see error log, you can view 2022-12-10 16:27:07,287 - mmdet - INFO - Epoch [1][50/29317] lr: 1.978e-03, eta: 1 day, 12:27:04, time: 0.373, data_time: 0.103, memory: 3889, loss_rpn_cls: 0.5394, loss_rpn_bbox: 0.1028, loss_cls: 1.1100, acc: 86.3379, loss_bbox: 0.0865, loss_mask: 0.7602, loss: 2.5990
is printed before error message)
I can check console output below when I inserted print(self, log_dict, runner.log_buffer.output)
in TextLoggerHook/log
function.
2022-12-10 17:04:02,551 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7f49792cb2e0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.10089991092681885), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.37457459926605224)])
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])
2022-12-10 17:04:16,786 - mmdet - INFO - Epoch [1][50/29317] lr: 1.978e-03, eta: 1 day, 12:35:58, time: 0.375, data_time: 0.101, memory: 3901, loss_rpn_cls: 0.5408, loss_rpn_bbox: 0.1029, loss_cls: 1.1096, acc: 86.3281, loss_bbox: 0.0862, loss_mask: 0.7601, loss: 2.5996
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 51)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])
We can check that TextLoggerHook
for gpu 1 is called on Iteration 51, which is not intended. It's because since gpu 1(or 2,3,4,...)'s runner.log_buffer
is not cleared, https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137 is satisfied, and self.log(runner)
is called.
As long as I guess, the reason why AssertionError: loss log variables are different across GPUs!
occured instead of deadlock is due to pytorch version. I use pytorch 1.8.2 this time, and when I experienced deadlock, I used pytorch 1.12.0. I guess some different implementation of dist.all_reduce btw different pytorch version may make different result, but I'm not quite sure.
I agree with @Fizzez's opinion.
TextLoggerHook
-> MMDetWandbHook
,MMDetWandbHook
's reset_flag
set to true
MMDetWandbHook
's runner.log_buffer.clear_output()
on LoggerHook
is called.MMDetWandbHook
's runner.log_buffer.clear_output()
on LoggerHook
is NOT called, since MMDetWandbHook
is @master_only
.runner.log_buffer
is not cleared, self.log
of TextLoggerHook
is continuously called after first log(even when it's not timing to log)TextLoggerHook/self.log
, there exists a dist.reduce
function.
TextLoggerHook/self.log
is not called (only) on GPU 0, dist.reduce
makes deadlock. (deadlock on dist.reduce
on GPU 1,2,3,..., and deadlock on dist.all_reduce
on GPU 0)
AssertionError: loss log variables are different across GPUs!
). Detailed reason why different error occurs is not quite sure.Solution is simple. Make MMDetWandbHook
to clear runner.log_buffer
on every process, including process for GPU 1,2,3...
Since I've figured out why this error occurs, I'm going to make PR for it, but since @Fizzez figured out the key reason, I'm going to mention you as a co-author. Is there no problem if I do this? I'm also going to make PR on mmseg too(if possible)
Hi @MilkClouds , thank you for the analysis. Glad to see that we have the same option on this. Your solution actually makes more sense by letting MMDetWandbHook
's runner.log_buffer.clear_output()
get called.
Also thank you for mentioning me in your PR. I really appreciate it.
@MilkClouds Thank you for your solution, it saves my days.
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug Hello mmdet developers,
We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.
Reproduction
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True GPU 0,1: Quadro GV100 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.10.0 PyTorch compiling details: PyTorch built with:TorchVision: 0.11.0 OpenCV: 4.5.5 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.25.0+ca11860
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.) We used the provided docker.Error traceback If applicable, paste the error trackback here.
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!