Closed Idefix0496 closed 2 years ago
👋 Hello @Idefix0496, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.
Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.
@Idefix0496 --batch-size works correctly.
Up to 8 dataloader workers are allowed per RANK. If your batch size is less than 8 per RANK then the worker count is reduced to match, otherwise there will be excess workers.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
Search before asking
YOLOv5 Component
Training
Bug
root@94ceb79c1cd3:/usr/src/app# python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 4 --epochs 3 --img 640 --data coco128.yaml --weights yolov5s.pt /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 c768919 Python-3.8.13 torch-1.12.0+cu113 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)
Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED) TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 2 115712 models.common.C3 [128, 128, 2] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 1182720 models.common.C3 [512, 512, 1] 9 -1 1 656896 models.common.SPPF [512, 512, 5] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs
Transferred 349/349 items from yolov5s.pt AMP: checks passed ✅ Scaled weight_decay = 0.0005 optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) train: Scanning '/usr/src/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s] val: Scanning '/usr/src/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s] Plotting labels to runs/train/exp16/labels.jpg...
AutoAnchor: 4.26 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅ Image sizes 640 train, 640 val Using 4 dataloader workers Logging results to runs/train/exp16 Starting training for 3 epochs...
3 epochs completed in 0.016 hours. Optimizer stripped from runs/train/exp16/weights/last.pt, 14.8MB Optimizer stripped from runs/train/exp16/weights/best.pt, 14.8MB
Validating runs/train/exp16/weights/best.pt... Fusing layers... Model summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 32/32 [00:05<00:00, 5.65it/s] all 128 929 0.749 0.617 0.717 0.474 person 128 254 0.892 0.686 0.804 0.509 bicycle 128 6 0.565 0.232 0.725 0.34 car 128 46 0.857 0.326 0.537 0.246 motorcycle 128 5 0.59 0.8 0.803 0.641 airplane 128 6 0.996 1 0.995 0.791 bus 128 7 0.564 0.714 0.825 0.713 train 128 3 1 0.549 0.698 0.474 truck 128 12 0.659 0.333 0.411 0.176 boat 128 6 1 0.319 0.449 0.143 traffic light 128 14 0.737 0.203 0.362 0.214 stop sign 128 2 0.864 1 0.995 0.822 bench 128 9 0.702 0.444 0.581 0.237 bird 128 16 0.903 1 0.995 0.643 cat 128 4 0.826 1 0.995 0.747 dog 128 9 0.77 0.745 0.907 0.644 horse 128 2 0.803 1 0.995 0.747 elephant 128 17 0.971 0.882 0.926 0.698 bear 128 1 0.492 1 0.995 0.995 zebra 128 4 0.883 1 0.995 0.906 giraffe 128 9 0.806 0.778 0.904 0.728 backpack 128 6 0.992 0.5 0.708 0.342 umbrella 128 18 0.88 0.816 0.916 0.478 handbag 128 19 0.724 0.158 0.257 0.134 tie 128 7 0.818 0.647 0.702 0.491 suitcase 128 4 0.867 1 0.995 0.51 frisbee 128 5 0.705 0.8 0.798 0.719 skis 128 1 0.748 1 0.995 0.497 snowboard 128 7 0.811 0.571 0.823 0.541 sports ball 128 6 0.649 0.667 0.667 0.314 kite 128 10 0.563 0.7 0.631 0.279 baseball bat 128 4 0.648 0.5 0.538 0.223 baseball glove 128 7 0.761 0.429 0.478 0.311 skateboard 128 5 0.706 0.6 0.659 0.444 tennis racket 128 7 0.79 0.543 0.587 0.319 bottle 128 18 0.672 0.389 0.536 0.294 wine glass 128 16 0.67 0.761 0.801 0.468 cup 128 36 0.859 0.509 0.777 0.506 fork 128 6 1 0.323 0.412 0.296 knife 128 16 0.78 0.688 0.754 0.424 spoon 128 22 0.649 0.409 0.562 0.299 bowl 128 28 0.855 0.631 0.705 0.505 banana 128 1 0.893 1 0.995 0.111 sandwich 128 2 0 0 0.19 0.172 orange 128 4 1 0.44 0.995 0.578 broccoli 128 11 0.352 0.455 0.402 0.306 carrot 128 24 0.685 0.542 0.703 0.492 hot dog 128 2 0.404 1 0.995 0.895 pizza 128 5 1 0.79 0.878 0.66 donut 128 14 0.663 1 0.957 0.814 cake 128 4 0.878 1 0.995 0.785 chair 128 35 0.536 0.6 0.558 0.276 couch 128 6 1 0.63 0.822 0.53 potted plant 128 14 0.752 0.649 0.775 0.521 bed 128 3 1 0 0.665 0.387 dining table 128 13 0.84 0.405 0.599 0.37 toilet 128 2 0.832 1 0.995 0.846 tv 128 2 0.743 1 0.995 0.796 laptop 128 3 1 0 0.747 0.328 mouse 128 2 1 0 0.0439 0.0219 remote 128 8 0.833 0.625 0.607 0.466 cell phone 128 8 0.576 0.25 0.363 0.188 microwave 128 3 0.714 1 0.995 0.699 oven 128 5 0.232 0.4 0.461 0.276 sink 128 6 0.186 0.167 0.328 0.24 refrigerator 128 5 0.674 0.8 0.808 0.513 book 128 29 0.559 0.241 0.323 0.145 clock 128 9 0.772 0.889 0.923 0.655 vase 128 2 0.324 1 0.828 0.745 scissors 128 1 1 0 0.124 0.0124 teddy bear 128 21 0.778 0.5 0.696 0.466 toothbrush 128 5 0.902 0.8 0.938 0.534 Results saved to runs/train/exp16 Destroying process group... root@94ceb79c1cd3:/usr/src/app#
Environment
Minimal Reproducible Example
python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size [n] --epochs 3 --img 640 --data coco128.yaml --weights yolov5s.pt
Additional
Hi, as stated in the Title, I've got a problem with the --batch-size Command. Instead of changing the batch size it changes the amount of workers allocated. Making changes to the --batch-size [n] is followed by changes in the Log output: Using [n] dataloader workers. While Training the CPU Load changes also accordingly. I've tried reinstalling everything but without success. Same issue. So in short I have no way to change the batch-size an therefore the amount of VRAM used by the GPUs. I hope that's enough Info so you can help me :-)
Are you willing to submit a PR?