ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.69k stars 16.33k forks source link

DDP `--sync-bn` bug with torch 1.9.0 #3998

Closed simba0703 closed 3 years ago

simba0703 commented 3 years ago

When i use 'python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 12 --data data/coco128.yaml --weights yolov5m6.pt --device 1,2,3 --adam --sync-bn',the training process will be blocked at epoch 0. And if i do not use '--sync-bn',the training process goes well.

🐛 Bug

A clear and concise description of what the bug is.

To Reproduce (REQUIRED)

Input:

import torch

a = torch.tensor([5])
c = a / 0

Output:

Traceback (most recent call last):
  File "/Users/glennjocher/opt/anaconda3/envs/env1/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-be04c762b799>", line 5, in <module>
    c = a / 0
RuntimeError: ZeroDivisionError

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

wudashuo commented 3 years ago

Got the same problem a week ago, it would be stuck if I use --sync-bn, remove it and the training will be fine. Trying to find out why, but I failed.

imyhxy commented 3 years ago

Encountered the same problem here. 🌗

glenn-jocher commented 3 years ago

@simba0703 @wudashuo @imyhxy thanks for the notice guys. Yes --sync is broken with torch 1.9.0, I can't figure out what the problem is though :(

If you you guys find a solution please let us know! In the meantime I'll add an assert to let users know this is a known issue.

You can still train DDP normally however, which I would recommend anyway, as all of the official models were trained without --sync.

jfpuget commented 3 years ago

This may be the cause: https://github.com/pytorch/pytorch/issues/37930

glenn-jocher commented 3 years ago

@simba0703 @wudashuo @imyhxy @jfpuget good news 😃! Your original issue may now be fixed ✅ in PR #4615. We discovered of the DPP --sync-bn issue was caused by TensorBoard add_graph() logging (used for visualizing the model interactively, example below). I don't know the exact cause and thus did not implement a fix, instead I implemented a workaround to avoid TensorBoard model visualization when --sync-bn is used.

131326663-ff9ca48a-7071-4432-b0c7-b7e3d3f32759

This means DDP training now works without issue with or without --sync-bn, but --sync-bn runs will not show a model visualization component in TensorBoard.

To receive this update:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!