Closed simba0703 closed 3 years ago
Got the same problem a week ago, it would be stuck if I use --sync-bn
, remove it and the training will be fine.
Trying to find out why, but I failed.
Encountered the same problem here. 🌗
@simba0703 @wudashuo @imyhxy thanks for the notice guys. Yes --sync is broken with torch 1.9.0, I can't figure out what the problem is though :(
If you you guys find a solution please let us know! In the meantime I'll add an assert to let users know this is a known issue.
You can still train DDP normally however, which I would recommend anyway, as all of the official models were trained without --sync.
This may be the cause: https://github.com/pytorch/pytorch/issues/37930
@simba0703 @wudashuo @imyhxy @jfpuget good news 😃! Your original issue may now be fixed ✅ in PR #4615. We discovered of the DPP --sync-bn
issue was caused by TensorBoard add_graph() logging (used for visualizing the model interactively, example below). I don't know the exact cause and thus did not implement a fix, instead I implemented a workaround to avoid TensorBoard model visualization when --sync-bn
is used.
This means DDP training now works without issue with or without --sync-bn
, but --sync-bn
runs will not show a model visualization component in TensorBoard.
To receive this update:
git pull
from within your yolov5/
directory or git clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
sudo docker pull ultralytics/yolov5:latest
to update your image Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!
When i use 'python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 12 --data data/coco128.yaml --weights yolov5m6.pt --device 1,2,3 --adam --sync-bn',the training process will be blocked at epoch 0. And if i do not use '--sync-bn',the training process goes well.
🐛 Bug
A clear and concise description of what the bug is.
To Reproduce (REQUIRED)
Input:
Output:
Expected behavior
A clear and concise description of what you expected to happen.
Environment
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.