ultralytics / yolov5

YOLOv5 ๐Ÿš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.31k stars 16.24k forks source link

Hyperparameter Evolution #607

Open glenn-jocher opened 4 years ago

glenn-jocher commented 4 years ago

๐Ÿ“š This guide explains hyperparameter evolution for YOLOv5 ๐Ÿš€. Hyperparameter evolution is a method of Hyperparameter Optimization using a Genetic Algorithm (GA) for optimization. UPDATED 28 March 2023.

Hyperparameters in ML control various aspects of training, and finding optimal values for them can be a challenge. Traditional methods like grid searches can quickly become intractable due to 1) the high dimensional search space 2) unknown correlations among the dimensions, and 3) expensive nature of evaluating the fitness at each point, making GA a suitable candidate for hyperparameter searches.

Before You Start

Clone repo and install requirements.txt in a Python>=3.7.0 environment, including PyTorch>=1.7. Models and datasets download automatically from the latest YOLOv5 release.

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

1. Initialize Hyperparameters

YOLOv5 has about 30 hyperparameters used for various training settings. These are defined in *.yaml files in the /data directory. Better initial guesses will produce better final results, so it is important to initialize these values properly before evolving. If in doubt, simply use the default values, which are optimized for YOLOv5 COCO training from scratch.

https://github.com/ultralytics/yolov5/blob/2da2466168116a9fa81f4acab744dc9fe8f90cac/data/hyps/hyp.scratch-low.yaml#L2-L34

2. Define Fitness

Fitness is the value we seek to maximize. In YOLOv5 we define a default fitness function as a weighted combination of metrics: mAP@0.5 contributes 10% of the weight and mAP@0.5:0.95 contributes the remaining 90%, with Precision P and Recall R absent. You may adjust these as you see fit or use the default fitness definition (recommended). https://github.com/ultralytics/yolov5/blob/4103ce9ad0393cc27f6c80457894ad7be0cb1f0d/utils/metrics.py#L12-L16

3. Evolve

Evolution is performed about a base scenario which we seek to improve upon. The base scenario in this example is finetuning COCO128 for 10 epochs using pretrained YOLOv5s. The base scenario training command is:

python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache

To evolve hyperparameters specific to this scenario, starting from our initial values defined in Section 1., and maximizing the fitness defined in Section 2., append --evolve:

# Single-GPU
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --evolve

# Multi-GPU
for i in 0 1 2 3 4 5 6 7; do
  sleep $(expr 30 \* $i) &&  # 30-second delay (optional)
  echo 'Starting GPU '$i'...' &&
  nohup python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --device $i --evolve > evolve_gpu_$i.log &
done

# Multi-GPU bash-while (not recommended)
for i in 0 1 2 3 4 5 6 7; do
  sleep $(expr 30 \* $i) &&  # 30-second delay (optional)
  echo 'Starting GPU '$i'...' &&
  "$(while true; do nohup python train.py... --device $i --evolve 1 > evolve_gpu_$i.log; done)" &
done

The default evolution settings will run the base scenario 300 times, i.e. for 300 generations. You can modify generations via the --evolve argument, i.e. python train.py --evolve 1000. https://github.com/ultralytics/yolov5/blob/6a3ee7cf03efb17fbffde0e68b1a854e80fe3213/train.py#L608

The main genetic operators are crossover and mutation. In this work mutation is used, with a 80% probability and a 0.04 variance to create new offspring based on a combination of the best parents from all previous generations. Results are logged to runs/evolve/exp/evolve.csv, and the highest fitness offspring is saved every generation as runs/evolve/hyp_evolved.yaml:

# YOLOv5 Hyperparameter Evolution Results
# Best generation: 287
# Last generation: 300
#    metrics/precision,       metrics/recall,      metrics/mAP_0.5, metrics/mAP_0.5:0.95,         val/box_loss,         val/obj_loss,         val/cls_loss
#              0.54634,              0.55625,              0.58201,              0.33665,             0.056451,             0.042892,             0.013441

lr0: 0.01  # initial learning rate (SGD=1E-2, Adam=1E-3)
lrf: 0.2  # final OneCycleLR learning rate (lr0 * lrf)
momentum: 0.937  # SGD momentum/Adam beta1
weight_decay: 0.0005  # optimizer weight decay 5e-4
warmup_epochs: 3.0  # warmup epochs (fractions ok)
warmup_momentum: 0.8  # warmup initial momentum
warmup_bias_lr: 0.1  # warmup initial bias lr
box: 0.05  # box loss gain
cls: 0.5  # cls loss gain
cls_pw: 1.0  # cls BCELoss positive_weight
obj: 1.0  # obj loss gain (scale with pixels)
obj_pw: 1.0  # obj BCELoss positive_weight
iou_t: 0.20  # IoU training threshold
anchor_t: 4.0  # anchor-multiple threshold
# anchors: 3  # anchors per output layer (0 to ignore)
fl_gamma: 0.0  # focal loss gamma (efficientDet default gamma=1.5)
hsv_h: 0.015  # image HSV-Hue augmentation (fraction)
hsv_s: 0.7  # image HSV-Saturation augmentation (fraction)
hsv_v: 0.4  # image HSV-Value augmentation (fraction)
degrees: 0.0  # image rotation (+/- deg)
translate: 0.1  # image translation (+/- fraction)
scale: 0.5  # image scale (+/- gain)
shear: 0.0  # image shear (+/- deg)
perspective: 0.0  # image perspective (+/- fraction), range 0-0.001
flipud: 0.0  # image flip up-down (probability)
fliplr: 0.5  # image flip left-right (probability)
mosaic: 1.0  # image mosaic (probability)
mixup: 0.0  # image mixup (probability)
copy_paste: 0.0  # segment copy-paste (probability)

We recommend a minimum of 300 generations of evolution for best results. Note that evolution is generally expensive and time consuming, as the base scenario is trained hundreds of times, possibly requiring hundreds or thousands of GPU hours.

4. Visualize

evolve.csv is plotted as evolve.png by utils.plots.plot_evolve() after evolution finishes with one subplot per hyperparameter showing fitness (y axis) vs hyperparameter values (x axis). Yellow indicates higher concentrations. Vertical distributions indicate that a parameter has been disabled and does not mutate. This is user selectable in the meta dictionary in train.py, and is useful for fixing parameters and preventing them from evolving.

evolve

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 3 years ago

@satyajitghana evolution does not produce weights, it evolves hyperparameters.

Samjith888 commented 3 years ago

i trained with --evolve with multi gpu

# Multi-GPU
for i in 0 1 2 3; do
  nohup python train.py --img 640 --batch 32 --workers 8 --multi-scale --epochs 10 --data dataset.yaml --single-cls --weights yolov5s.pt --evolve --cache --device $i 2>&1 | tee evolve_gpu_$i.log &
done

but didn't get the weights in save, the folder was empty; train/evolve/weights was empty

Getting key error while using the same command

satyajitghana commented 3 years ago

@satyajitghana evolution does not produce weights, it evolves hyperparameters.

aaah okay, makes sense. cool.

satyajitghana commented 3 years ago

@Samjith888 was using a custom dataset, try with the included coco128.yaml

but turns out, evolution wasn't meant to save weights, so i guess its okay, i got my hyper params though.

Samjith888 commented 3 years ago

@Samjith888 was using a custom dataset, try with the included coco128.yaml

but turns out, evolution wasn't meant to save weights, so i guess its okay, i got my hyper params though.

I'm using a custom dataset , not for coco128.yaml.. Already posted the error earlier here, but it disappeared somehow .

frederikvanduuren commented 3 years ago

Hi All,

I did comment this line: #'anchors': (0, 2.0, 10.0), # anchors per output grid (0 to ignore)

but still getting this error: Traceback (most recent call last): File "train.py", line 561, in hyp[k] = max(hyp[k], v[1]) # lower limit KeyError: 'anchors'

any ideas? fix? thanx Frederik

glenn-jocher commented 3 years ago

@frederikvanduuren for hyperparameter evolution this line should be uncommented, and set to 0 to ignore, or to a standard anchor count (i.e. 3) to evolve anchor count.

TommyZihao commented 3 years ago

How to choose the number of the epoch? In this tutorial, the epoch is 10, but if I have my own custom dataset. The baseline model needs 75 epochs to convergence. Should I set --evolve --epochs 75?

frederikvanduuren commented 3 years ago

@TommyZihao Hi Tommy, i did set the epochs to 100, but you should see how much you need so the mAP & recall does converge to a maximum level

TommyZihao commented 3 years ago

@TommyZihao Hi Tommy, i did set the epochs to 100, but you should see how much you need so the mAP & recall does converge to a maximum level

Would 100 epochs take a super long time?

frederikvanduuren commented 3 years ago

Yes... So?

On Thu, Jan 7, 2021, 11:33 Tommy in Tongji notifications@github.com wrote:

@TommyZihao https://github.com/TommyZihao Hi Tommy, i did set the epochs to 100, but you should see how much you need so the mAP & recall does converge to a maximum level

Would 100 epochs take a super long time?

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution#issuecomment-756032132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH65NOMLTQSBDMTFJDSUWSDSYWEYFANCNFSM4PSWHVPQ .

wwdok commented 3 years ago

Hi, @glenn-jocher , i used command 'python train.py --img 640 --batch 16 --epochs 300 --data mydataset.yaml --weights yolov5l.pt --evolve --cache' to take a try of Hyperparameter Evolution, but after more than 100 epochs, my yolov5/runs/train/evolve folder just have following files, no yolov5/runs/evolve/hyp_evolved.yaml, the weights folder is empty too. image

meanwhile, in the yolov5 root folder, there is no yolov5/evolve.txt and yolov5/evolve.png. Do you have any idea what might be the possible cause ? thanks ๏ผ

frederikvanduuren commented 3 years ago

You need patience...

On Sat, Jan 23, 2021, 17:51 weida wang notifications@github.com wrote:

Hi, @glenn-jocher https://github.com/glenn-jocher , i used command 'python train.py --img 640 --batch 16 --epochs 300 --data mydataset.yaml --weights yolov5l.pt --evolve --cache' to take a try of Hyperparameter Evolution, but after more than 100 epochs, my yolov5/runs/train/evolve folder just have following files, no yolov5/runs/evolve/hyp_evolved.yaml, the weights folder is empty too. [image: image] https://user-images.githubusercontent.com/43233772/105608195-79598e80-5ddd-11eb-8c73-fa9eacd0fc3a.png

meanwhile, in the yolov5 root folder, there is no yolov5/evolve.txt and yolov5/evolve.png. Do you have any idea what might be the possible cause ? thanks ๏ผ

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution#issuecomment-766138283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH65NOK66XAUWAVL6I2O2I3S3L5C7ANCNFSM4PSWHVPQ .

glenn-jocher commented 3 years ago

@wwdok you have not finished training a single generation. < 1 generation will not produce any evolution output.

wwdok commented 3 years ago

@frederikvanduuren @glenn-jocher I got it ! In my case, 300 epochs is 1 generation, by default, it need 1 generation to output yolov5/runs/evolve/hyp_evolved.yaml, and need 300 generations(that means 90000 epochs) to output the yolov5/evolve.png

glenn-jocher commented 3 years ago

@wwdok yes exactly! For this reason it may make sense to use a different base scenario that produces a fitness faster (i.e. perhaps only train to 100 epochs). But be careful, because you want your base scenario results to correlate strongly with your actual underlying scenario (training 300 epochs), so as you reduce your epoch count to zero the correlation with 300-epoch results will also reduce to zero.

In layman's terms, hyperparameters that help you achieve the best results over short trainings (i.e. in 10 epochs), will not be the same ones that help you achieve the best results at 300 epochs. At 10 epochs things like weight decay don't matter for example, so evolving on short trainings will cause your weight decay to evolve down to zero, which will cause earlier overfitting and worse results at 300 epochs. It's a balancing act each person has to decide on.

thhart commented 3 years ago

@wwdok yes exactly! For this reason it may make sense to use a different base scenario that produces a fitness faster (i.e. perhaps only train to 100 epochs). But be careful, because you want your base scenario results to correlate strongly with your actual underlying scenario (training 300 epochs), so as you reduce your epoch count to zero the correlation with 300-epoch results will also reduce to zero.

@glenn-jocher Can we clarify this a bit please, I read it like this: 10 epochs for evolving will result in 3000 total epochs generating 300 generations of hyperparameters. Or is it really 90000 as @wwdok stated (which would be ridiculous time wise)? Further I understand any img size parameters will not taken into account, but could we evolve with a smaller resolution set perhaps to speed it up a bit? If so how to achieve this other than modify the data set?

glenn-jocher commented 3 years ago

@thhart its very simple. Your base scenario is run for n generations. Your base scenario is what you are optimizing, it's epoch count is up to you.

thhart commented 3 years ago

@thhart its very simple. Your base scenario is run for n generations. Your base scenario is what you are optimizing, it's epoch count is up to you.

Simple is in the eye of the viewer, however it is not obvious IMHO how to configure the amount of generations being calculated. So when is this amount hit when I use --epoch 10 as parameter?

glenn-jocher commented 3 years ago

@thhart I think the number you're looking for is just epochs * generations

LinusJ79 commented 3 years ago

Running --evolve and after the first 10 epochs (first run out of 300 or whatever it is) i get the following error:

anchor_t anchors box cls cls_pw degrees fl_gamma fliplr flipud giou hsv_h hsv_s hsv_v iou_t lr0 lrf mixup momentum mosaic obj obj_pwperspective scale shear translatewarmup_bias_lrwarmup_epochswarmup_momentumweight_decay 4 3 0.05 0.5 1 0 0 0 0 0.05 0.014 0.68 0.36 0.2 0.01 0.2 0 0.937 1 1 1 0.001 0.5 0 0 0.1 3 0.8 0.0005 Evolved fitness: 0.4112 0.2054 0.2402 0.08847 0.04911 0.02177 0.02316

Traceback (most recent call last): File "train.py", line 578, in hyp[k] = float(x[i + 7] * v[i]) # mutate IndexError: index 28 is out of bounds for axis 0 with size 28

Any idea what is wrong?

glenn-jocher commented 3 years ago

@LinusJ79 if you believe you have a reproducible bug please raise a full bug report issue using the bug report template with code to reproduce, thank you!

RobinBram commented 3 years ago

Hi! A few questons regarding evolve @glenn-jocher 1: why hasn't fl_gamma, flip_ud or iou_th changed in your picture? 2: Why are only 22 of the 28 hyperparameters evolved? 2: Is there a way to lock in certain hyperparameters that I know I want at a certain value? 3: Is mosiac disabled as usual with -rect?

glenn-jocher commented 3 years ago

@RobinBram

  1. You can disable evolution for any parameters using the meta dictionary in train.py, or by setting their initial values to zero in your hyp.yaml file.
  2. The displayed results are from an earlier version with less hyperparmeters, we should update this.
  3. Yes, see #1 above, use meta dict.
  4. Yes --rect causes mosaic to be disabled.
abhiagwl4262 commented 3 years ago

@glenn-jocher If we initialise any hyper parameter with 0 and the minimum value is also 0.0 then it doesn't evolve because - hyp[k] = max(hyp[k], v[1]) # lower limit hyp[k] = min(hyp[k], v[2]) # upper limit hyp[k] = round(hyp[k], 5) # significant digits

glenn-jocher commented 3 years ago

@abhiagwl4262 yes this is correct. The mutations are gain-based so a zero initial condition will prevent it from mutating.

abhiagwl4262 commented 3 years ago

@glenn-jocher As initial values of hyper-parameter, we are using hyp.scratch which have some parameters initialised with 0. So, those parameters are not taking part in mutation. Can you please add hyp.scratch that sets better initialisation for Hyper-parameters.

glenn-jocher commented 3 years ago

@abhiagwl4262 yes I see. You may want to use hyp.finetune.yaml to see if it's a better starting point for evolution.

If you want you can also increase the zero values in hyp.scratch slightly, i.e. to 0.01 or 0.1 to initialize them for evolution.

NMVRodrigues commented 3 years ago

What is the population size for the GA? It seems it only trains 1 model? Is this correct? If so, what is it performing crossover with?

glenn-jocher commented 3 years ago

@NMVRodrigues yes that is correct, population size is 1 due to high expense of each member, so we omit crossover and apply mutation to a randomly selected top-5 member from all previous populations. The implementation is here. If you have ideas for improvement please let us know! https://github.com/ultralytics/yolov5/blob/ed2c74218d6d46605cc5fa68ce9bd6ece213abe4/train.py#L572-L597

NMVRodrigues commented 3 years ago

@glenn-jocher Well in this scenario that would not ensure the exploitation of optimal solutions. There are other mutation operators that could be added, or just replace the standard one, to improve this scenario without any significant complexity/expense increase. I would be glad to try and implement them to help and improve this feature!

glenn-jocher commented 3 years ago

@NMVRodrigues yes, this is a challenge as we have a unique problem set that is not quite suited to the off-the-shelf GA methods, mainly due to the very high evaluation cost of a single population member, so the current implementation is the best compromise we found. Feel free to submit PRs with any updates to the evolution code in train.py if you see spots for improvement!

zzttqu commented 3 years ago

I have a problem when I evolve the hyper params, there is no model.pt in the evolve folder,and the wandb broke out a error.I think it is because the weight didn't save but transfered to wandb, in my code it is at line445. I changed the no save option to avoid this error, but it can't evolve. I disable the wandb and it can run correctly. Can you fix this? thanks in advance. image

glenn-jocher commented 3 years ago

@zzttqu thanks for the info! Evolution will not save any checkpoints (for speed). If you believe you have a reproducible bug, please raise a new issue using the ๐Ÿ› Bug Report template, providing screenshots and a minimum reproducible example to help us better understand and diagnose your problem. Thank you! @AyushExel seems like --evolve with wandb may have a problem.

AyushExel commented 3 years ago

@glenn-jocher okay I'll try to reproduce this and fix it after our meeting. Can we also include a CI test for evolve to automate this process in the future ?

glenn-jocher commented 3 years ago

@AyushExel evolve CI is an interesting idea. I'd need to add an optional argument to --evolve in some way because currently evolution is hard-coded to 300 generations.

AyushExel commented 3 years ago

@glenn-jocher @zzttqu I found the problem. The default behaviour now is to log the final stripped model but the model is not found in case of evolve operation. I'll just push a quick fix

AyushExel commented 3 years ago

This should fix the problem https://github.com/ultralytics/yolov5/pull/2634

glenn-jocher commented 3 years ago

@zzttqu evolve with wandb should be fixed now in #2634, please git pull or clone again to receive this update and let us know if you run into any other problems!

psyjw commented 3 years ago

@glenn-jocher Hi, sorry for bothering you. I have a question about hyperparameter evolve. I first evolved my own model No.1 (which is based on yolov5s) for 10 generations and trained a new model No.2 based on new hyperparameters. Then I would like to continue the evolution. According to what you said in another thread, if the 'evolve.txt' exists, then I just need to run the same command (which evolves on model No.1). However, when I ran the command, I found that the hyperparameters displayed at the start do not match with the hyperparameters in the evolve.txt (ones I used to train model No.2). I just wonder is this correct? Sorry if it is trivial, I am new to this area. Thank you for your time and help in advance!

glenn-jocher commented 3 years ago

@psyjw they should not match, you are starting a new generation.

psyjw commented 3 years ago

@glenn-jocher Much thanks! I thought it will display the last generation's hyperparameters. By the way, could you tell me is there any difference if I continue training evolution on model No.1 or I train evolution on model No.2? If I train on model No.2, does it still read the values from the 'evolve.txt'?

glenn-jocher commented 3 years ago

Evolution scenario is entirely up to you. Evolve.txt is the sole source used.

youyuxiansen commented 3 years ago

what is the mutate formula corresponding to the code below? @glenn-jocher

https://github.com/ultralytics/yolov3/blob/26cb451811b7aca5ddd069d03167c1db9b711a6b/train.py#L606

glenn-jocher commented 3 years ago

@youyuxiansen this is a probabilistic mutation equation bounded at upper and lower limits, prototyped using empirical results. There is no documentation other than the actual equation. To understand it better you can simply use it to generate a population of values and visualize the histogram of the population.

youyuxiansen commented 3 years ago

@youyuxiansen this is a probabilistic mutation equation bounded at upper and lower limits, prototyped using empirical results. There is no documentation other than the actual equation. To understand it better you can simply use it to generate a population of values and visualize the histogram of the population.

@glenn-jocher Thank you for such a timely response! I still have some questions. Can you explain the means of "mutation probability" and "sigma"? Why it be chosen to 0.8 and 0.2? And why the v be clipped to (0.3, 3.0)? I guess you must make some try on it right? I'm interested in learning about the inspiration! I would be grateful if you could talk about the details. Thanks!

glenn-jocher commented 3 years ago

@youyuxiansen all parameters above are based upon empirical results of YOLOv5 evolution experimentation with COCO

youyuxiansen commented 3 years ago

@youyuxiansen all parameters above are based upon empirical results of YOLOv5 evolution experimentation with COCO

Got it๏ผThanks.

OrjwanZaafarani commented 3 years ago

I'm trying to run --evolve on 2 GPUs but the process is stuck at nohup: redirecting stderr to stdout

This is the shell script that I ran:

#!/bin/bash
for i in 1 2; do
  nohup python train.py --img 640 --batch 16 --epochs 100 --data QMUL.yaml --weights yolov5m.pt --cache --evolve --device $i > evolve_gpu_$i.log &
done

Am I doing something wrong?

glenn-jocher commented 3 years ago

@OrjwanZaafarani you can always try the command without the nohup if it's causing problems, or you can redirect to /dev/null:

In ipython console:

# YOLOv5m6 evolve on COCO
for i in [0, 1, 2, 3, 4, 5, 6, 7]:
  !python train.py --batch 32 --weights '' --cfg yolov5m6.yaml --data coco.yaml --epochs 300 --img 640 --hyp hyp.scratch-p6-evolve.yaml --evolve --device {i} > /dev/null 2>&1 &