Multiple gpus cannot be trained Multiple gpus cannot be trained

menkeyi commented 1 year ago

Search before asking

[X] I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Multi-GPU

Bug

YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB) OS: Ubuntu 22.04.1 LTS python: 3.9

(yolov5_cuda11_6) root@admin:~/git_project/yolov5# python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data ./data/mycoco2017.yaml --weights ./yolov5s.pt --epochs 1 --device 0,1 WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=./yolov5s.pt, cfg=, data=./data/mycoco2017.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=1, batch_size= 32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucke t=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, worke rs=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[ 0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest Command 'git fetch origin' timed out after 5 seconds YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmupmomentum=0.8, warmup bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsvh=0.015, hsv s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs in Weights & Biases ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Then the output stops,No problem with a single card

Environment

YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB) OS: Ubuntu 22.04.1 LTS python: 3.9

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

glenn-jocher commented 1 year ago

@menkeyi always train DDP in Docker for reduced environment risk.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

menkeyi commented 1 year ago

@menkeyi always train DDP in Docker for reduced environment risk.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:

Google Cloud Deep Learning VM. See GCP Quickstart Guide

Amazon Deep Learning AMI. See AWS Quickstart Guide

Docker Image. See Docker Quickstart Guide

Build Docker Image: yolov5/utils/docker/Dockerfile

DDP training is still the problem

root@d9fff3881c10:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --epochs 3 --data data/mycoco2017.yaml wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=yolov5s.pt, cfg=, data=data/mycoco2017.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: up to date with https://github.com/ultralytics/yolov5 ✅ YOLOv5 🚀 v6.2-226-gfde7758 Python-3.8.13 torch-1.13.0a0+08820cb CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

glenn-jocher commented 1 year ago

@menkeyi DDP training in Docker works correctly. I'm training two 4x GPU models right now correctly. Make sure your master port address is not already in use.

glenn-jocher commented 1 year ago

github-actions[bot] commented 1 year ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

ultralytics / yolov5