Closed ChongyuNVIDIA closed 2 years ago
👋 Hello @ChongyuNVIDIA, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.
Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.
@ChongyuNVIDIA I've run your exact command with a smaller batch size since our machines are all in use, and I have no problems. Note the warning that torch.distributed.launch
is deprecated in favor of torch.distributed.run
, but this still worked for me.
python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 64 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.1-111-gb7faeda torch 1.11.0+cu113 CUDA:0 (A100-SXM-80GB, 81251MiB)
CUDA:1 (A100-SXM-80GB, 81251MiB)
CUDA:2 (A100-SXM-80GB, 81251MiB)
CUDA:3 (A100-SXM-80GB, 81251MiB)
CUDA:4 (A100-SXM-80GB, 81251MiB)
CUDA:5 (A100-SXM-80GB, 81251MiB)
CUDA:6 (A100-SXM-80GB, 81251MiB)
CUDA:7 (A100-SXM-80GB, 81251MiB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
from n params module arguments
0 -1 1 1760 models.common.Conv [3, 16, 6, 2, 2]
1 -1 1 4672 models.common.Conv [16, 32, 3, 2]
2 -1 1 4800 models.common.C3 [32, 32, 1]
3 -1 1 18560 models.common.Conv [32, 64, 3, 2]
4 -1 2 29184 models.common.C3 [64, 64, 2]
5 -1 1 73984 models.common.Conv [64, 128, 3, 2]
6 -1 3 156928 models.common.C3 [128, 128, 3]
7 -1 1 295424 models.common.Conv [128, 256, 3, 2]
8 -1 1 296448 models.common.C3 [256, 256, 1]
9 -1 1 164608 models.common.SPPF [256, 256, 5]
10 -1 1 33024 models.common.Conv [256, 128, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 90880 models.common.C3 [256, 128, 1, False]
14 -1 1 8320 models.common.Conv [128, 64, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 22912 models.common.C3 [128, 64, 1, False]
18 -1 1 36992 models.common.Conv [64, 64, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 74496 models.common.C3 [128, 128, 1, False]
21 -1 1 147712 models.common.Conv [128, 128, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 296448 models.common.C3 [256, 256, 1, False]
24 [17, 20, 23] 1 115005 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
val: Scanning '/usr/src/datasets/coco/val2017.cache' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
Plotting labels to runs/train/exp/labels.jpg...
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...
Epoch gpu_mem box obj cls labels img_size
0/299 0.971G 0.1124 0.0528 0.1085 58 640: 0%| | 1/1849 [00:03<1:57:24, 3.81s/it] Reducer buckets have been rebuilt in this iteration.
0/299 1G 0.1047 0.08592 0.1023 133 640: 3%|▎ | 58/1849 [00:49<24:51, 1.20it/s]
@ChongyuNVIDIA with torch.distributed.run
all warnings disappear for me and training runs equally well.
@ChongyuNVIDIA pushed #7337 to resolve multi-dataset scanning printout with DDP. Unrelated to your issue but something I realized when seeing our output.
In DDP, the batch size in the command line is the global batch size, right? So in your case, the batch size for each GPU is only 64/8=8 right?
In my training process, the steps in the following position will take a long time. Also curious whether it is expected behavior?
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]
@ChongyuNVIDIA yes --batch is total batch size across all GPUs, so in my command each GPU using 64/8 = 8 batch size.
Some steps may take a long time, especially for large datasets. In my example code above it probably takes about 30 seconds from command to first batches training, but larger datasets, AutoAnchor etc. may take several minutes.
I tried the batch size 64 as @glenn-jocher, but still hang at the beginning.
train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp3/labels.jpg...
AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp3
Starting training for 300 epochs...
Epoch gpu_mem box obj cls labels img_size
0%| | 0/1849 [00:00<?, ?it/s]
Sure, batch size is irrelevant, my example was at 64 because our GPUs are already in use training other models.
On Thu, 7 Apr 2022 at 18:31 ChongyuNVIDIA @.***> wrote:
I tried the batch size 64 as @glenn-jocher https://github.com/glenn-jocher, but still hang at the beginning.
train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp3/labels.jpg...
AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp3
Starting training for 300 epochs...
Epoch gpu_mem box obj cls labels img_size
0%| | 0/1849 [00:00<?, ?it/s]
— Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov5/issues/7336#issuecomment-1091959701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMXEGIMR4KX37BTKARDLT3VD4EUNANCNFSM5SZQ35TQ . You are receiving this because you were mentioned.Message ID: @.***>
Glenn Jocher Founder & CEO, Ultralytics +1 301 237 6695 https://www.twitter.com/ultralytics https://www.youtube.com/ultralytics https://www.github.com/ultralytics https://www.linkedin.com/company/ultralytics
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
I had the same issue and infact using a smaller batch size did fix it. If it is hanging your batch size is still too big.
@tarunsharma1 i had the same issue and indeed, using a smaller batch size did resolve it for me. If your training process is hanging, it is likely that your batch size is still too large.
Search before asking
YOLOv5 Component
Multi-GPU
Bug
I follow the recommended steps by using the docker and the DDP multi-GPU training. However, the training will hang at the first training epoch.
Environment
YOLO version: latest with commit id: 0ca85ed65f124871fa7686dcf0efbd8dc9699856 GPU Type: Tesla V100-SXM2-16GB-N, 16160MiB GPU Number: 8 Docker: nvidia/pytorch:21.10-py3 PyTorch Version: torch 1.11.0+cu113 Torchvision Version: torchvision 0.12.0+cu113 Driver Version: 470.82.01 CUDA Version: 11.3
Minimal Reproducible Example
The command lines to prepare the env:
The command is as follows:
The log for this training and hang:
Additional
No response
Are you willing to submit a PR?