deadlock using Wandb - Githubissues

linnanwang commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug Hello mmdet developers,

We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.

Reproduction

What command or script did you run?

./tools/dist_train.sh ./configs/yolox/yolox_s_8x8_300e_coco.py 2

Did you make any modifications on the code or config? Did you understand what you have modified? No
What dataset did you use? MSCOCO

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True GPU 0,1: Quadro GV100 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.10.0 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0 OpenCV: 4.5.5 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.25.0+ca11860

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) We used the provided docker.

Error traceback If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

hhaAndroid commented 2 years ago

@linnanwang Thanks for your feedback. Could you please help me to verify it @ayulockin? My machines are all disconnected from the network, there is no way to verify.

hhaAndroid commented 2 years ago

@linnanwang Are you referring to the configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py added by the MMDetWandbHook configuration？

linnanwang commented 2 years ago

@hhaAndroid Thanks for the quick response. I'm referring to ./configs/yolox/yolox_s_8x8_300e_coco.py.

ayulockin commented 2 years ago

Hey @linnanwang thanks for raising this. So as I understand, using MMDetWandbHook is not working properly in a multi-GPU setting.

Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.

Fizzez commented 2 years ago

Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.

Hi @ayulockin I'm facing the same issue here. There was no bug reported but I found the GPU utilization went to 100% with 0 "GPU Time Spent Accessing Memory" (which indicated the deadlock if I understand correctly?). Then the run hung-up.

It happened by the last iteration of the 1st epoch. I was wondering if this issue was related to #6486 until I came here.

By the way, is the one from mmcv (https://github.com/open-mmlab/mmcv/blob/ea173c9f07f0abf6873d2b7d786fb6411843cf00/mmcv/runner/hooks/logger/wandb.py) workable with multi-GPU?

ayulockin commented 2 years ago

Thanks for more info @Fizzez.

Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only. The same applies to this hook. I am investigating.

Fizzez commented 2 years ago

@ayulockin Thank you for the quick reply.

Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only.

I see. I thought the MMDetWandbHook could not run with multi GPU env. My mistake.

If it's possible, could you please also share any ideas may point to this issue? Actually I am working on it and need a quick fix.

hhaAndroid commented 2 years ago

@ayulockin Thank you very much for helping to check.

ayulockin commented 2 years ago

Hey @linnanwang, @Fizzez, I tried training the model on 2 P100 GPUs by doing this: bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2

I couldn't reproduce the deadlock issue. If you check out the system metrics in this W&B run page, you will see that memory is allocated for both the GPUs and that both are used for training.

linnanwang commented 2 years ago

perhaps it is a unique problem of ./configs/yolox/yolox_s_8x8_300e_coco.py?

Fizzez commented 2 years ago

In my case I used ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py Also I have checked running with MMCV's WandbLoggerHook was fine. It may not be a unique problem ..

linnanwang commented 2 years ago

Perhaps this is a problem of this particular model, could you take a look at yolox_s_8x8_300e_coco?

MilkClouds commented 2 years ago

I experience same phenomena(deadlock over 30 minute) on dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py, only on distributed learning(and more than 1 gpu) setting. 1 gpu training is okay.

haruishi43 commented 2 years ago

I faced the same problem with mmseg using MMSegWandbHook (https://github.com/open-mmlab/mmsegmentation/issues/2137). Oddly, this always occurs right after the first output of the TextLoggerHook

haruishi43 commented 2 years ago

In my case, I found out that putting MMSegWandbHook before other loggers (i.e., TextLoggerHook) solved the issue of deadlock. Maybe this might solve your issue as well if you put MMDetWandbHook on top @linnanwang @MilkClouds

Fizzez commented 1 year ago

Finally managed to solve this by setting reset_flag=True in TextLoggerHook. (Although it's kind of an ugly fix..)

i.e. use config like the following:

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', reset_flag=True),
        dict(type='MMDetWandbHook', 'Set MMDetWandbHook properly here')
    ])

As far as I investigated, the deadlock was caused by dist.all_reduce() here: https://github.com/open-mmlab/mmdetection/blob/v2.25.0/mmdet/models/detectors/base.py#L204

The reason seems to be:

reset_flag of the logger hook which has the lowest priority should be set to True in order to clear the runner.log_buffer after logging is done.
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137
If you put MMDetWandbHook after the TextLoggerHook in your config file. MMDetWandbHook has the lowest priority thus its reset_flag is True, in all processes for all the GPUs.
While setting reset_flag=True for MMDetWandbHook is fine on GPU 0, the real LAST HOOK on GPUs besides 0, TextLoggerHook, didn't have reset_flag=True properly set.
- because MMDetWandbHook is basically master_only
Consequently, on GPUs besides 0, TextLoggerHook never clear runner.log_buffer and execute dist.reduce(mem_mb, 0, op=dist.ReduceOp.MAX) to collect memory size, ignoring the logging interval. dist.reduce seems to have blocked dist.all_reduce() which I mentioned in the beginning... But I'm not quite sure.
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/text.py#L116

p.s. I used the following environment

mmdet==2.25.0
- ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
mmcv==1.5.0
wandb==0.12.17
multi-GPU machine

MilkClouds commented 1 year ago

(New) Error report and analysis

Hi, I had different type of error but seems to occur because of same reason.

I ran bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2 and met error saying AssertionError: loss log variables are different across GPUs! occured.

here mask_rcnn_r50_fpn_1x_wandb_coco.py.txt is a log text.

And this error occur right after 50 iteration. (if you see error log, you can view 2022-12-10 16:27:07,287 - mmdet - INFO - Epoch [1][50/29317] lr: 1.978e-03, eta: 1 day, 12:27:04, time: 0.373, data_time: 0.103, memory: 3889, loss_rpn_cls: 0.5394, loss_rpn_bbox: 0.1028, loss_cls: 1.1100, acc: 86.3379, loss_bbox: 0.0865, loss_mask: 0.7602, loss: 2.5990 is printed before error message)

I can check console output below when I inserted print(self, log_dict, runner.log_buffer.output) in TextLoggerHook/log function.

2022-12-10 17:04:02,551 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7f49792cb2e0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.10089991092681885), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.37457459926605224)])
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])
2022-12-10 17:04:16,786 - mmdet - INFO - Epoch [1][50/29317]    lr: 1.978e-03, eta: 1 day, 12:35:58, time: 0.375, data_time: 0.101, memory: 3901, loss_rpn_cls: 0.5408, loss_rpn_bbox: 0.1029, loss_cls: 1.1096, acc: 86.3281, loss_bbox: 0.0862, loss_mask: 0.7601, loss: 2.5996
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 51)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])

We can check that TextLoggerHook for gpu 1 is called on Iteration 51, which is not intended. It's because since gpu 1(or 2,3,4,...)'s runner.log_buffer is not cleared, https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137 is satisfied, and self.log(runner) is called.

As long as I guess, the reason why AssertionError: loss log variables are different across GPUs! occured instead of deadlock is due to pytorch version. I use pytorch 1.8.2 this time, and when I experienced deadlock, I used pytorch 1.12.0. I guess some different implementation of dist.all_reduce btw different pytorch version may make different result, but I'm not quite sure.

The reason why error occurs (summarization)

I agree with @Fizzez's opinion.

When we set order of hook to TextLoggerHook -> MMDetWandbHook,
MMDetWandbHook's reset_flag set to true
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137
on GPU 0, MMDetWandbHook's runner.log_buffer.clear_output() on LoggerHook is called.
meanwhile, on GPU 1(2,3,4,...), MMDetWandbHook's runner.log_buffer.clear_output() on LoggerHook is NOT called, since MMDetWandbHook is @master_only.
since GPU 1(2,3,4,...)'s runner.log_buffer is not cleared, self.log of TextLoggerHook is continuously called after first log(even when it's not timing to log)
in TextLoggerHook/self.log, there exists a dist.reduce function.
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/text.py#L116
Since TextLoggerHook/self.log is not called (only) on GPU 0, dist.reduce makes deadlock. (deadlock on dist.reduce on GPU 1,2,3,..., and deadlock on dist.all_reduce on GPU 0)
- But depending on pytorch version, this problem make global deadlock(on 1.12.0), or local deadlock(on 1.8.2, which leds to AssertionError: loss log variables are different across GPUs!). Detailed reason why different error occurs is not quite sure.

Solution

Solution is simple. Make MMDetWandbHook to clear runner.log_buffer on every process, including process for GPU 1,2,3...

Since I've figured out why this error occurs, I'm going to make PR for it, but since @Fizzez figured out the key reason, I'm going to mention you as a co-author. Is there no problem if I do this? I'm also going to make PR on mmseg too(if possible)

Fizzez commented 1 year ago

Hi @MilkClouds , thank you for the analysis. Glad to see that we have the same option on this. Your solution actually makes more sense by letting MMDetWandbHook's runner.log_buffer.clear_output() get called. Also thank you for mentioning me in your PR. I really appreciate it.

hankyul2 commented 10 months ago

@MilkClouds Thank you for your solution, it saves my days.

open-mmlab / mmdetection

deadlock using Wandb #8145

(New) Error report and analysis

The reason why error occurs (summarization)

Solution