Multi-GPU training not working for 3x GPU system?

mattpopovich commented 3 years ago

🐛 Bug

I have a system with 3x 1080 GPU's. When I attempt to run multi-gpu training, it seems to load the model onto the GPUs (can tell from nvidia-smi), but hangs before actually training. I've been able to confirm that the command I'm using works for multi-gpu training on a 4x V100 system, so I believe that the command is correct. Both are running on the yolov5 docker container.

To Reproduce

1) Enter docker container 2) Begin multi-gpu training on system with 3x GPUs

Output

root@PC:/home/username/git/yolov5# python -m torch.distributed.launch --nproc_per_node 2 train.py --img 640 --batch 12 --epochs 5 --data data/coco128.yaml --weights yolov5s.pt
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
github: skipping check (Docker image)
YOLOv5 🚀 v4.0-138-ged2c742 torch 1.8.0a0+52ea372 CUDA:0 (GeForce GTX 1080, 8119.5625MB)
                                                 CUDA:1 (GeForce GTX 1080, 8118.25MB)
                                                 CUDA:2 (GeForce GTX 1080, 8119.5625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Namespace(adam=False, batch_size=6, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', entity=None, epochs=5, evolve=False, exist_ok=False, global_rank=0, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], linear_lr=False, local_rank=0, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp46', single_cls=False, sync_bn=False, total_batch_size=12, weights='yolov5s.pt', workers=8, world_size=2)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0

***[hangs here]***

^CKilling subprocess 32646
Killing subprocess 32647
Main process received SIGINT, exiting

Another thing I saw that was interesting was if I try to train on 2 GPUs but use a batch that is only divisible by 2 (and not 3), it errors out telling me that the batch is not divisible by 3 (which shouldn't matter because I'm trying to train on 2 GPUs, not 3 GPUs)

root@PC:/home/username/git/yolov5# python -m torch.distributed.launch --nproc_per_node 2 train.py --img 640 --batch 16 --epochs 5 --data data/coco128.yaml --weight
s yolov5s.pt
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
github: skipping check (Docker image)
Traceback (most recent call last):
  File "train.py", line 505, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/home/username/git/yolov5/utils/torch_utils.py", line 68, in select_device
    assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
AssertionError: batch-size 16 not multiple of GPU count 3
Traceback (most recent call last):
  File "train.py", line 505, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/home/username/git/yolov5/utils/torch_utils.py", line 68, in select_device
    assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
AssertionError: batch-size 16 not multiple of GPU count 3
Killing subprocess 32769
Killing subprocess 32770
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=1', '--img', '640', '--batch', '16', '--epochs', '5', '--data', 'data/coco128.yaml', '--weights', 'yolov5s.pt']' returned non-zero exit status 1.

Expected behavior

I expected yolov5 to also show me a description of the model's layers being loaded onto the GPUs then to begin training.

root@PC2:/home/username/git/yolov5# python -m torch.distributed.launch --nproc_per_node 2 train.py --img 640 --batch 16 --epochs 5 --data data/coco128.yaml --weight
s yolov5s.pt --device 1,3
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal pe
rformance in your application as needed.
*****************************************
github: skipping check (Docker image)
YOLOv5 v4.0-126-g886f1c0 torch 1.8.0a0+52ea372 CUDA:1 (Tesla V100-PCIE-32GB, 32510.5MB)
                                               CUDA:3 (Tesla V100-PCIE-32GB, 32510.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Namespace(adam=False, batch_size=8, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='1,3', entity=None, epochs=5, evolve=False, exist_ok=False,
global_rank=0, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], linear_lr=False, local_rank=0, log_artifacts=False, log_imgs=16, multi_scale=False
, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp4', single_cls=False,
sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=2)
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=
1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0
, fliplr=0.5, mosaic=1.0, mixup=0.0

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  1    156928  models.common.C3                        [128, 128, 3]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  1    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], $128, 256, 512]]
Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.1 GFLOPS

Environment

If applicable, add screenshots to help explain your problem.

OS: Ubuntu 18.04
GPU 3x1080, 4xV100

Additional context

What I see from nvidia-smi when attempting to train on 2/3 GPUs with batch of 12 when yolov5 is stuck and frozen/hanging. Seems like GPU0 doesn't finish initializing? It has half the memory usage of GPU1.

Tue Mar 16 13:47:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   48C    P2    49W / 180W |    588MiB /  8119MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:21:00.0 Off |                  N/A |
| 25%   46C    P2    49W / 180W |   1040MiB /  8118MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:49:00.0 Off |                  N/A |
| 25%   35C    P8     7W / 180W |     13MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2299      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      4321      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   3008749      C   /opt/conda/bin/python             575MiB |
|    1   N/A  N/A      2299      G   /usr/lib/xorg/Xorg                 59MiB |
|    1   N/A  N/A      4321      G   /usr/lib/xorg/Xorg                321MiB |
|    1   N/A  N/A      4470      G   /usr/bin/gnome-shell               14MiB |
|    1   N/A  N/A      5186      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A      7422      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A      7560      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A      9152      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A      9614      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A     20199      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A     32910      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A     32988      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A   2659631      G   ...AAAAAAAA== --shared-files       37MiB |
|    1   N/A  N/A   3008750      C   /opt/conda/bin/python             575MiB |
|    2   N/A  N/A      2299      G   /usr/lib/xorg/Xorg                  4MiB |

This is what I see on PC2 once the model has been loaded (and before it begins training)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   42C    P0    44W / 250W |   1059MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   43C    P0    46W / 250W |   1055MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P0    25W / 250W |      4MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0    26W / 250W |      4MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     24322      C   /opt/conda/bin/python            1055MiB |
|    1   N/A  N/A     24323      C   /opt/conda/bin/python            1051MiB |
+-----------------------------------------------------------------------------+

github-actions[bot] commented 3 years ago

👋 Hello @mattpopovich, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

mattpopovich commented 3 years ago

For what it's worth, I took out one of the 1080 GPU's in PC such that it now has 2x1080's. Multi-GPU training still freezes. Not sure if that's a sign of something wrong with my PC or configuration or what. I'll continue to look into it.

root@PC:/home/username/git/yolov5# python -m torch.distributed.launch --nproc_per_node 2 train.py --img 640 --batch 12 --epochs 5 --data data/coco128.yaml --weights yolov5s.pt
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
github: skipping check (Docker image)
YOLOv5 🚀 v4.0-138-ged2c742 torch 1.8.0a0+52ea372 CUDA:0 (GeForce GTX 1080, 8118.25MB)
                                                 CUDA:1 (GeForce GTX 1080, 8119.5625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Namespace(adam=False, batch_size=6, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', entity=None, epochs=5, evolve=False, exist_ok=False, global_rank=0, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], linear_lr=False, local_rank=0, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp51', single_cls=False, sync_bn=False, total_batch_size=12, weights='yolov5s.pt', workers=8, world_size=2)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
^C

***[hangs here]***

Killing subprocess 5471
Killing subprocess 5472
Main process received SIGINT, exiting

glenn-jocher commented 3 years ago

@mattpopovich thanks for the bug report. In general it is not recommended to train with odd GPU counts, or different types of GPUs on a single system. We don't run any CI tests with this sort of setup, nor do any cloud providers provide these sorts of systems.

I see in your SMI output device 1 is showing a slightly different memory profile than devices 0 and 2, so this may be cause of the errors you are seeing (perhaps different OEMs?), or you may also have environmental/driver issues on your local machine.

Your batch divisibility check should reflect your utilized device count, not your total device count, so I will add a TODO to look into this specifically, and also link you here to our other supported environments that you may want to try training on.

Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

glenn-jocher commented 3 years ago

TODO: batch divisibility utilized devices vs total devices.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

glenn-jocher commented 3 years ago

@mattpopovich good news 😃! Your original issue may now be partially fixed ✅ in PR #3276. This PR checks the utilized device count rather than total device count when checking batch_size divisibility among your CUDA devices in DDP. To receive this update you can:

git pull from within your yolov5/ directory
git clone https://github.com/ultralytics/yolov5 again
Force-reload PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
View our updated notebooks:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

ultralytics / yolov5