ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.9k stars 16.39k forks source link

Training command --batch-size is changing the workers-size instead of batch-size #8500

Closed Idefix0496 closed 2 years ago

Idefix0496 commented 2 years ago

Search before asking

YOLOv5 Component

Training

Bug

root@94ceb79c1cd3:/usr/src/app# python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 4 --epochs 3 --img 640 --data coco128.yaml --weights yolov5s.pt /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: (30 second timeout) 3 wandb: You chose 'Don't visualize my results' train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 c768919 Python-3.8.13 torch-1.12.0+cu113 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED) TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

             from  n    params  module                                  arguments

0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 2 115712 models.common.C3 [128, 128, 2] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 1182720 models.common.C3 [512, 512, 1] 9 -1 1 656896 models.common.SPPF [512, 512, 5] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

Transferred 349/349 items from yolov5s.pt AMP: checks passed ✅ Scaled weight_decay = 0.0005 optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) train: Scanning '/usr/src/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s] val: Scanning '/usr/src/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s] Plotting labels to runs/train/exp16/labels.jpg...

AutoAnchor: 4.26 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅ Image sizes 640 train, 640 val Using 4 dataloader workers Logging results to runs/train/exp16 Starting training for 3 epochs...

 Epoch   gpu_mem       box       obj       cls    labels  img_size
   0/2     4.74G   0.04671   0.09375   0.04592        27       640:   6%|▋         | 2/32 [00:05<01:15,  2.51s/it]                                                                                              Reducer buckets have been rebuilt in this iteration.
   0/2     4.75G   0.05006   0.07357   0.03664        16       640: 100%|██████████| 32/32 [00:14<00:00,  2.23it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 32/32 [00:13<00:00,  2.39it/s]
             all        128        929      0.749      0.617      0.717      0.475

 Epoch   gpu_mem       box       obj       cls    labels  img_size
   1/2     4.75G   0.05107   0.08029   0.03365        22       640: 100%|██████████| 32/32 [00:08<00:00,  3.92it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 32/32 [00:04<00:00,  6.46it/s]
             all        128        929      0.729      0.581      0.674       0.42

 Epoch   gpu_mem       box       obj       cls    labels  img_size
   2/2     4.75G   0.04708    0.0727   0.03571        20       640: 100%|██████████| 32/32 [00:08<00:00,  3.97it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 32/32 [00:04<00:00,  6.63it/s]
             all        128        929      0.738      0.637      0.712      0.455

3 epochs completed in 0.016 hours. Optimizer stripped from runs/train/exp16/weights/last.pt, 14.8MB Optimizer stripped from runs/train/exp16/weights/best.pt, 14.8MB

Validating runs/train/exp16/weights/best.pt... Fusing layers... Model summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 32/32 [00:05<00:00, 5.65it/s] all 128 929 0.749 0.617 0.717 0.474 person 128 254 0.892 0.686 0.804 0.509 bicycle 128 6 0.565 0.232 0.725 0.34 car 128 46 0.857 0.326 0.537 0.246 motorcycle 128 5 0.59 0.8 0.803 0.641 airplane 128 6 0.996 1 0.995 0.791 bus 128 7 0.564 0.714 0.825 0.713 train 128 3 1 0.549 0.698 0.474 truck 128 12 0.659 0.333 0.411 0.176 boat 128 6 1 0.319 0.449 0.143 traffic light 128 14 0.737 0.203 0.362 0.214 stop sign 128 2 0.864 1 0.995 0.822 bench 128 9 0.702 0.444 0.581 0.237 bird 128 16 0.903 1 0.995 0.643 cat 128 4 0.826 1 0.995 0.747 dog 128 9 0.77 0.745 0.907 0.644 horse 128 2 0.803 1 0.995 0.747 elephant 128 17 0.971 0.882 0.926 0.698 bear 128 1 0.492 1 0.995 0.995 zebra 128 4 0.883 1 0.995 0.906 giraffe 128 9 0.806 0.778 0.904 0.728 backpack 128 6 0.992 0.5 0.708 0.342 umbrella 128 18 0.88 0.816 0.916 0.478 handbag 128 19 0.724 0.158 0.257 0.134 tie 128 7 0.818 0.647 0.702 0.491 suitcase 128 4 0.867 1 0.995 0.51 frisbee 128 5 0.705 0.8 0.798 0.719 skis 128 1 0.748 1 0.995 0.497 snowboard 128 7 0.811 0.571 0.823 0.541 sports ball 128 6 0.649 0.667 0.667 0.314 kite 128 10 0.563 0.7 0.631 0.279 baseball bat 128 4 0.648 0.5 0.538 0.223 baseball glove 128 7 0.761 0.429 0.478 0.311 skateboard 128 5 0.706 0.6 0.659 0.444 tennis racket 128 7 0.79 0.543 0.587 0.319 bottle 128 18 0.672 0.389 0.536 0.294 wine glass 128 16 0.67 0.761 0.801 0.468 cup 128 36 0.859 0.509 0.777 0.506 fork 128 6 1 0.323 0.412 0.296 knife 128 16 0.78 0.688 0.754 0.424 spoon 128 22 0.649 0.409 0.562 0.299 bowl 128 28 0.855 0.631 0.705 0.505 banana 128 1 0.893 1 0.995 0.111 sandwich 128 2 0 0 0.19 0.172 orange 128 4 1 0.44 0.995 0.578 broccoli 128 11 0.352 0.455 0.402 0.306 carrot 128 24 0.685 0.542 0.703 0.492 hot dog 128 2 0.404 1 0.995 0.895 pizza 128 5 1 0.79 0.878 0.66 donut 128 14 0.663 1 0.957 0.814 cake 128 4 0.878 1 0.995 0.785 chair 128 35 0.536 0.6 0.558 0.276 couch 128 6 1 0.63 0.822 0.53 potted plant 128 14 0.752 0.649 0.775 0.521 bed 128 3 1 0 0.665 0.387 dining table 128 13 0.84 0.405 0.599 0.37 toilet 128 2 0.832 1 0.995 0.846 tv 128 2 0.743 1 0.995 0.796 laptop 128 3 1 0 0.747 0.328 mouse 128 2 1 0 0.0439 0.0219 remote 128 8 0.833 0.625 0.607 0.466 cell phone 128 8 0.576 0.25 0.363 0.188 microwave 128 3 0.714 1 0.995 0.699 oven 128 5 0.232 0.4 0.461 0.276 sink 128 6 0.186 0.167 0.328 0.24 refrigerator 128 5 0.674 0.8 0.808 0.513 book 128 29 0.559 0.241 0.323 0.145 clock 128 9 0.772 0.889 0.923 0.655 vase 128 2 0.324 1 0.828 0.745 scissors 128 1 1 0 0.124 0.0124 teddy bear 128 21 0.778 0.5 0.696 0.466 toothbrush 128 5 0.902 0.8 0.938 0.534 Results saved to runs/train/exp16 Destroying process group... root@94ceb79c1cd3:/usr/src/app#

Environment

Minimal Reproducible Example

python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size [n] --epochs 3 --img 640 --data coco128.yaml --weights yolov5s.pt

Additional

Hi, as stated in the Title, I've got a problem with the --batch-size Command. Instead of changing the batch size it changes the amount of workers allocated. Making changes to the --batch-size [n] is followed by changes in the Log output: Using [n] dataloader workers. While Training the CPU Load changes also accordingly. I've tried reinstalling everything but without success. Same issue. So in short I have no way to change the batch-size an therefore the amount of VRAM used by the GPUs. I hope that's enough Info so you can help me :-)

Are you willing to submit a PR?

github-actions[bot] commented 2 years ago

👋 Hello @Idefix0496, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 2 years ago

@Idefix0496 --batch-size works correctly.

Up to 8 dataloader workers are allowed per RANK. If your batch size is less than 8 per RANK then the worker count is reduced to match, otherwise there will be excess workers.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!