Training with Docker and recommended DDP command on multi-GPUs will hang

Search before asking

[X] I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Multi-GPU

Bug

I follow the recommended steps by using the docker and the DDP multi-GPU training. However, the training will hang at the first training epoch.

Environment

YOLO version: latest with commit id: 0ca85ed65f124871fa7686dcf0efbd8dc9699856 GPU Type: Tesla V100-SXM2-16GB-N, 16160MiB GPU Number: 8 Docker: nvidia/pytorch:21.10-py3 PyTorch Version: torch 1.11.0+cu113 Torchvision Version: torchvision 0.12.0+cu113 Driver Version: 470.82.01 CUDA Version: 11.3

Minimal Reproducible Example

The command lines to prepare the env:

apt update && apt install -y zip htop screen libgl1-mesa-glx

python -m pip install --upgrade pip

pip uninstall -y torch torchvision torchtext

pip install --no-cache -r requirements.txt albumentations wandb gsutil notebook \
    torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

export OMP_NUM_THREADS=8

The command is as follows:

python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco_Chong_NGC.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 1024 --imgsz 640 --device 0,1,2,3,4,5,6,7

The log for this training and hang:

root@2789503:/ngc_0/Ultralytics/YOLOv5# python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 1024 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf...
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=1024, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 287bae0 torch 1.11.0+cu113 CUDA:0 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:1 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:2 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:3 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:4 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:5 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:6 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:7 (Tesla V100-SXM2-16GB-N, 16160MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]
  2                -1  1      4800  models.common.C3                        [32, 32, 1]
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  4                -1  2     29184  models.common.C3                        [64, 64, 2]
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  6                -1  3    156928  models.common.C3                        [128, 128, 3]
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  8                -1  1    296448  models.common.C3                        [256, 256, 1]
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 24      [17, 20, 23]  1    115005  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs

Scaled weight_decay = 0.008
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))

train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp/labels.jpg...

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/116 [00:00<?, ?it/s]

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

👋 Hello @ChongyuNVIDIA, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@ChongyuNVIDIA I've run your exact command with a smaller batch size since our machines are all in use, and I have no problems. Note the warning that torch.distributed.launch is deprecated in favor of torch.distributed.run, but this still worked for me.

python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 64 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.1-111-gb7faeda torch 1.11.0+cu113 CUDA:0 (A100-SXM-80GB, 81251MiB)
                                               CUDA:1 (A100-SXM-80GB, 81251MiB)
                                               CUDA:2 (A100-SXM-80GB, 81251MiB)
                                               CUDA:3 (A100-SXM-80GB, 81251MiB)
                                               CUDA:4 (A100-SXM-80GB, 81251MiB)
                                               CUDA:5 (A100-SXM-80GB, 81251MiB)
                                               CUDA:6 (A100-SXM-80GB, 81251MiB)
                                               CUDA:7 (A100-SXM-80GB, 81251MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]              
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]                
  2                -1  1      4800  models.common.C3                        [32, 32, 1]                   
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  4                -1  2     29184  models.common.C3                        [64, 64, 2]                   
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  6                -1  3    156928  models.common.C3                        [128, 128, 3]                 
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  8                -1  1    296448  models.common.C3                        [256, 256, 1]                 
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]                 
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]               
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]           
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]                
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]          
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 24      [17, 20, 23]  1    115005  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs

Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
val: Scanning '/usr/src/datasets/coco/val2017.cache' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]                                  
Plotting labels to runs/train/exp/labels.jpg... 
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
     0/299    0.971G    0.1124    0.0528    0.1085        58       640:   0%|          | 1/1849 [00:03<1:57:24,  3.81s/it]                                                                         Reducer buckets have been rebuilt in this iteration.
     0/299        1G    0.1047   0.08592    0.1023       133       640:   3%|▎         | 58/1849 [00:49<24:51,  1.20it/s]

@ChongyuNVIDIA with torch.distributed.run all warnings disappear for me and training runs equally well.

@ChongyuNVIDIA pushed #7337 to resolve multi-dataset scanning printout with DDP. Unrelated to your issue but something I realized when seeing our output.

In DDP, the batch size in the command line is the global batch size, right? So in your case, the batch size for each GPU is only 64/8=8 right?

In my training process, the steps in the following position will take a long time. Also curious whether it is expected behavior?

Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]

@ChongyuNVIDIA yes --batch is total batch size across all GPUs, so in my command each GPU using 64/8 = 8 batch size.

Some steps may take a long time, especially for large datasets. In my example code above it probably takes about 30 seconds from command to first batches training, but larger datasets, AutoAnchor etc. may take several minutes.

I tried the batch size 64 as @glenn-jocher, but still hang at the beginning.

train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp3/labels.jpg...
AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp3
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/1849 [00:00<?, ?it/s]

Sure, batch size is irrelevant, my example was at 64 because our GPUs are already in use training other models.

On Thu, 7 Apr 2022 at 18:31 ChongyuNVIDIA @.***> wrote:

I tried the batch size 64 as @glenn-jocher https://github.com/glenn-jocher, but still hang at the beginning.

train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label

val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...

Plotting labels to runs/train/exp3/labels.jpg...

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅

Image sizes 640 train, 640 val

Using 64 dataloader workers

Logging results to runs/train/exp3

Starting training for 300 epochs...
 Epoch   gpu_mem       box       obj       cls    labels  img_size
0%| | 0/1849 [00:00<?, ?it/s]

— Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov5/issues/7336#issuecomment-1091959701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMXEGIMR4KX37BTKARDLT3VD4EUNANCNFSM5SZQ35TQ . You are receiving this because you were mentioned.Message ID: @.***>

-- https://www.ultralytics.com/

Glenn Jocher Founder & CEO, Ultralytics +1 301 237 6695 https://www.twitter.com/ultralytics https://www.youtube.com/ultralytics https://www.github.com/ultralytics https://www.linkedin.com/company/ultralytics

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

I had the same issue and infact using a smaller batch size did fix it. If it is hanging your batch size is still too big.

@tarunsharma1 i had the same issue and indeed, using a smaller batch size did resolve it for me. If your training process is hanging, it is likely that your batch size is still too large.

ultralytics / yolov5