ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.56k stars 16.3k forks source link

Multiple gpus cannot be trained Multiple gpus cannot be trained #10045

Closed menkeyi closed 1 year ago

menkeyi commented 1 year ago

Search before asking

YOLOv5 Component

Training, Multi-GPU

Bug

YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB) OS: Ubuntu 22.04.1 LTS python: 3.9

(yolov5_cuda11_6) root@admin:~/git_project/yolov5# python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data ./data/mycoco2017.yaml --weights ./yolov5s.pt --epochs 1 --device 0,1 WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=./yolov5s.pt, cfg=, data=./data/mycoco2017.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=1, batch_size= 32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucke t=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, worke rs=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[ 0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest Command 'git fetch origin' timed out after 5 seconds YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmupmomentum=0.8, warmup bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsvh=0.015, hsv s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs in Weights & Biases ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Then the output stops,No problem with a single card

Environment

YOLOv5 🚀 v6.2-0-gd3ea0df Python-3.9.0 torch-1.13.0+cu116 CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB) CUDA:1 (NVIDIA A100 80GB PCIe, 81100MiB) OS: Ubuntu 22.04.1 LTS python: 3.9

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

glenn-jocher commented 1 year ago

@menkeyi always train DDP in Docker for reduced environment risk.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

menkeyi commented 1 year ago

@menkeyi always train DDP in Docker for reduced environment risk.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Build Docker Image: yolov5/utils/docker/Dockerfile

DDP training is still the problem

root@d9fff3881c10:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --epochs 3 --data data/mycoco2017.yaml wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=yolov5s.pt, cfg=, data=data/mycoco2017.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: up to date with https://github.com/ultralytics/yolov5 ✅ YOLOv5 🚀 v6.2-226-gfde7758 Python-3.8.13 torch-1.13.0a0+08820cb CUDA:0 (NVIDIA A100 80GB PCIe, 81100MiB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

glenn-jocher commented 1 year ago

@menkeyi DDP training in Docker works correctly. I'm training two 4x GPU models right now correctly. Make sure your master port address is not already in use.

glenn-jocher commented 1 year ago
Screenshot 2022-11-06 at 17 31 12
github-actions[bot] commented 1 year ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!