Closed mattpopovich closed 3 years ago
👋 Hello @mattpopovich, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.
For what it's worth, I took out one of the 1080 GPU's in PC such that it now has 2x1080's. Multi-GPU training still freezes. Not sure if that's a sign of something wrong with my PC or configuration or what. I'll continue to look into it.
root@PC:/home/username/git/yolov5# python -m torch.distributed.launch --nproc_per_node 2 train.py --img 640 --batch 12 --epochs 5 --data data/coco128.yaml --weights yolov5s.pt
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
github: skipping check (Docker image)
YOLOv5 🚀 v4.0-138-ged2c742 torch 1.8.0a0+52ea372 CUDA:0 (GeForce GTX 1080, 8118.25MB)
CUDA:1 (GeForce GTX 1080, 8119.5625MB)
Added key: store_based_barrier_key:1 to store for rank: 0
Namespace(adam=False, batch_size=6, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', entity=None, epochs=5, evolve=False, exist_ok=False, global_rank=0, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], linear_lr=False, local_rank=0, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp51', single_cls=False, sync_bn=False, total_batch_size=12, weights='yolov5s.pt', workers=8, world_size=2)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
^C
***[hangs here]***
Killing subprocess 5471
Killing subprocess 5472
Main process received SIGINT, exiting
@mattpopovich thanks for the bug report. In general it is not recommended to train with odd GPU counts, or different types of GPUs on a single system. We don't run any CI tests with this sort of setup, nor do any cloud providers provide these sorts of systems.
I see in your SMI output device 1 is showing a slightly different memory profile than devices 0 and 2, so this may be cause of the errors you are seeing (perhaps different OEMs?), or you may also have environmental/driver issues on your local machine.
Your batch divisibility check should reflect your utilized device count, not your total device count, so I will add a TODO to look into this specifically, and also link you here to our other supported environments that you may want to try training on.
Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt
again. We also highly recommend using one of our verified environments below.
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
TODO: batch divisibility utilized devices vs total devices.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@mattpopovich good news 😃! Your original issue may now be partially fixed ✅ in PR #3276. This PR checks the utilized device count rather than total device count when checking batch_size divisibility among your CUDA devices in DDP. To receive this update you can:
git pull
from within your yolov5/
directorygit clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!
🐛 Bug
I have a system with 3x 1080 GPU's. When I attempt to run multi-gpu training, it seems to load the model onto the GPUs (can tell from
nvidia-smi
), but hangs before actually training. I've been able to confirm that the command I'm using works for multi-gpu training on a 4x V100 system, so I believe that the command is correct. Both are running on the yolov5 docker container.To Reproduce
1) Enter docker container 2) Begin multi-gpu training on system with 3x GPUs
Output
Another thing I saw that was interesting was if I try to train on 2 GPUs but use a batch that is only divisible by 2 (and not 3), it errors out telling me that the batch is not divisible by 3 (which shouldn't matter because I'm trying to train on 2 GPUs, not 3 GPUs)
Expected behavior
I expected yolov5 to also show me a description of the model's layers being loaded onto the GPUs then to begin training.
Environment
If applicable, add screenshots to help explain your problem.
Additional context
What I see from
nvidia-smi
when attempting to train on 2/3 GPUs with batch of 12 when yolov5 is stuck and frozen/hanging. Seems like GPU0 doesn't finish initializing? It has half the memory usage of GPU1.This is what I see on PC2 once the model has been loaded (and before it begins training)