Closed Darshcg closed 3 years ago
@Darshcg I recommend all DDP trainings be run in our Docker image, with a custom PyTorch install in the image if necessary from https://pytorch.org/get-started/locally/
π Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 π resources:
Access additional Ultralytics β‘ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 π and Vision AI β!
Hi @Darshcg , I have exactlyly the same issue. Are there any updates? Thank you!
Try to terminate the works on all the nodes, then launch the training again
I also had the same problem, did you solve itοΌ
@zhangze1 it seems like there may be an issue with the DDP (Distributed Data Parallel) setup in the current configuration. One solution is to terminate the works on all the nodes and launch the training again, as you have suggested. Another potential solution would be to use a custom PyTorch install in the Docker image. You can find more information on this here: https://pytorch.org/get-started/locally/.
If these solutions do not work or if you need more help, please provide more details about your setup and the specific error messages you are encountering, and we can try to help you solve the problem.
Hi @glenn-jocher ,
I am trying to work on training the Yolov5m on a custom Dataset through Multi GPU training. But when I am running the command: sudo python3 -m torch.distributed.run --nproc_per_node 4 train.py --batch 32 --data coco128.yaml --weights yolov5m.pt --device 0,1,2,3 it is continoudly showing: Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) and the training is not getting started.
Here is the detailed log of running sudo python3 -m torch.distributed.run --nproc_per_node 4 train.py --batch 32 --data coco128.yaml --weights yolov5m.pt --device 0,1,2,3
Can anyone help me to resolve the issue? I am about to train on 4 V100 GPUs having 16GB mem each, with Pytorch 1.9 and CUDA 10.2.
Thanks, Darshan C G