Closed menkeyi closed 1 year ago
@menkeyi always train DDP in Docker for reduced environment risk.
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
@menkeyi always train DDP in Docker for reduced environment risk.
Environments
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
- Notebooks with free GPU:
- Google Cloud Deep Learning VM. See GCP Quickstart Guide
- Amazon Deep Learning AMI. See AWS Quickstart Guide
- Docker Image. See Docker Quickstart Guide
Build Docker Image: yolov5/utils/docker/Dockerfile
DDP training is still the problem
root@d9fff3881c10:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --epochs 3 --data data/mycoco2017.yaml wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=yolov5s.pt, cfg=, data=data/mycoco2017.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: up to date with https://github.com/ultralytics/yolov5 ✅ YOLOv5 🚀 v6.2-226-gfde7758 Python-3.8.13 torch-1.13.0a0+08820cb CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB)
Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
@menkeyi DDP training in Docker works correctly. I'm training two 4x GPU models right now correctly. Make sure your master port address is not already in use.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
Search before asking
YOLOv5 Component
Training, Multi-GPU
Bug
YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB) OS: Ubuntu 22.04.1 LTS python: 3.9
(yolov5_cuda11_6) root@admin:~/git_project/yolov5# python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data ./data/mycoco2017.yaml --weights ./yolov5s.pt --epochs 1 --device 0,1 WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=./yolov5s.pt, cfg=, data=./data/mycoco2017.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=1, batch_size= 32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucke t=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, worke rs=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[ 0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest Command 'git fetch origin' timed out after 5 seconds YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB)
Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmupmomentum=0.8, warmup bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsvh=0.015, hsv s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs in Weights & Biases ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Then the output stops,No problem with a single card
Environment
YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB) OS: Ubuntu 22.04.1 LTS python: 3.9
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?