Open glenn-jocher opened 4 years ago
If I set the generations to 300 and started training for 100 epoch. How many times will it train?
@OrjwanZaafarani generations indicates how many times your training loop will run. 300 generations means your model will train 300 times.
@glenn-jocher So I'm running it for 30000 epoch haha. Thanks.
@OrjwanZaafarani no not really, you are repeating your base scenario 300 times.
Each scenario can be anything you want, in your case you have a 100 epoch training, but each training is independent, so you are not training a single model for 30000 epochs.
What number of epochs should I specify?
@amfogor evolution base scenario is completely up to you.
Is there a known example of applying Weights & Biases sweeps to the Hyperparameter?
@aficionadoai yes: https://wandb.ai/glenn-jocher/COCO128_evolve/sweeps/f9d7fyj2
I am trying out evolve from a base training which has ran 200 epoch (yolov5s img size 1280 using hyp scratch.)
I have not seen a evolve.txt generated out for me yet.. But this is the result.txt inside the evolve folder. Looks like the training loss is increasing.. Is that normal ? If its not... i would rather stop the trial now
Copy of segment of result.txt from evolve folder 190/999 3.77G 0.04028 0.1747 0.01328 0.2282 746 1280 0.5444 0.4805 0.4781 0.3003 0.03869 0.263 0.01426 191/999 3.77G 0.04043 0.1773 0.01338 0.2311 831 1280 0.5866 0.4537 0.4782 0.3004 0.03869 0.263 0.01426 192/999 3.77G 0.04025 0.175 0.01342 0.2287 682 1280 0.5439 0.4813 0.4784 0.3004 0.03869 0.263 0.01426 193/999 3.77G 0.04036 0.1753 0.01327 0.2289 751 1280 0.5425 0.4818 0.4783 0.3005 0.03869 0.263 0.01426 194/999 3.77G 0.04027 0.174 0.01322 0.2275 1290 1280 0.585 0.4542 0.4784 0.3005 0.03869 0.263 0.01425 195/999 3.77G 0.04024 0.174 0.01332 0.2276 781 1280 0.5638 0.4661 0.4783 0.3005 0.03869 0.263 0.01425 196/999 3.77G 0.04031 0.1743 0.01337 0.2279 611 1280 0.5864 0.4532 0.4785 0.3006 0.0387 0.263 0.01425 197/999 3.77G 0.04029 0.1786 0.0131 0.232 1069 1280 0.5871 0.4522 0.4786 0.3004 0.0387 0.263 0.01425 198/999 3.77G 0.04019 0.1763 0.01329 0.2298 769 1280 0.5884 0.4525 0.4786 0.3005 0.0387 0.263 0.01425 199/999 3.77G 0.04023 0.1759 0.01328 0.2294 990 1280 0.5868 0.453 0.4785 0.3004 0.0387 0.263 0.01425 200/999 3.77G 0.04026 0.1742 0.01322 0.2277 847 1280 0.5831 0.4539 0.4785 0.3005 0.0387 0.263 0.01425 201/999 3.77G 0.04016 0.1773 0.01332 0.2308 726 1280 0.585 0.4537 0.4785 0.3006 0.0387 0.263 0.01425 .... .... .... 660/999 4.37G 0.05093 0.2059 0.01411 0.271 940 1280 0 0 0 0 0 0 0 661/999 4.37G 0.05132 0.2052 0.0141 0.2706 1018 1280 0 0 0 0 0 0 0 662/999 4.37G 0.05131 0.2082 0.01401 0.2736 947 1280 0 0 0 0 0 0 0 663/999 4.37G 0.05148 0.206 0.01391 0.2713 950 1280 0 0 0 0 0 0 0 664/999 4.37G 0.05125 0.2093 0.01401 0.2745 946 1280 0 0 0 0 0 0 0 665/999 4.37G 0.05129 0.2075 0.01394 0.2728 1075 1280 0 0 0 0 0 0 0 666/999 4.37G 0.05147 0.2081 0.01396 0.2735 857 1280 0 0 0 0 0 0 0 667/999 4.37G 0.05124 0.2089 0.01399 0.2742 1112 1280 0 0 0 0 0 0 0 668/999 4.37G 0.05125 0.2081 0.01401 0.2733 849 1280 0 0 0 0 0 0 0 669/999 4.37G 0.05126 0.2059 0.01387 0.2711 989 1280 0 0 0 0 0 0 0 670/999 4.37G 0.05124 0.2085 0.01392 0.2736 963 1280 0 0 0 0 0 0 0 671/999 4.37G 0.05125 0.2057 0.01394 0.2709 706 1280 0 0 0 0 0 0 0 672/999 4.37G 0.05114 0.2065 0.01378 0.2714 823 1280 0 0 0 0 0 0 0 673/999 4.37G 0.05146 0.2087 0.01396 0.2741 843 1280 0 0 0 0 0 0 0 674/999 4.37G 0.0515 0.2088 0.01403 0.2743 834 1280 0 0 0 0 0 0 0
@shang0085 evolve.txt will be generated once the first generation has completed.
@shang0085 evolve.txt will be generated once the first generation has completed.
Yeah my understanding is 300 generations as a default. I have seen that my evolve have ran 300 epochs starting off from a base of 200 epochs? So in my case 300 generations would not mean 300 epoch? What would happen if it reaches the end of base training number, which for my case I set my base as 1000 epoch. Would it continue to evolve and ignore the 1000 limit number ?
@shang0085 a generation is 1 training. A training is whatever you decide.
@glenn-jocher Hello,how can i go back to the unfinished evolve process,like a 300 generations whole evolve but unfortunately ended at 200th due to various reasons
@billie7 to resume evolution you simply re-run the same command, and evolution will start from an evolve.txt if it finds it.
@glenn-jocher I changed some parsers in train.py including"--weights""--cfg""--data""--hyp""--epochs"",and my command is [python train.py --evolve],every time I just re-run this command to resume,but the generations went to 350 and seemed not to stop,but the evolution default times is 300 which i didn't change.I can find the evolve.txt ,and it has 350 lines XD
@billie7 default generations is 300 which you can modify as you see fit, ie python train.py —evolve 100
@glenn-jocher Yes,I did not modify the default generations,but this evolution process didi not stop at it's 300 times. Should i link the hyp.yaml in the train.py to runs/train/evolve/hyp_evolved.yaml?
@billie7 yes that's because the evolution command will run 300 generations by default. If it finds an evolve.txt it will start from there.
Thanks a lot,a stupid mistake i've made
where I can find yolov5/evolve.png I can't find it after evolve also the visualization images where to find?
how to change crossover and mutation
@besbesmany evolve.txt is plotted as evolve.png after evolution completes. The console printout is very clear I would say:
All evolution code is inside train.py
I have a fairly large dataset (900+ classes) and evolution is a bit out of my price range. I was curious if anyone had luck evolving on a subset of data? I know many model parameters won't transfer as they are dependent on the training set size, but it seems certain parameters, such as image augmentation, may work across models.
IF we can identify such hyper-parameters, would it be of value to train multiple subsets and utilize an ensemble to further generalize outcomes?
@Bellk17 that's an interesting idea! I think most people take shortcuts in the epochs dimension rather than the dataset dimension, i.e. evolving COCO on < 300 epochs rather than 300 epochs, but using a subset of the dataset might work better.
The compute-saving test would be if evolving on 10% of your dataset converges faster and/or correlates better with full dataset results than the same with 10% of epochs. I'm optimistic there the answer might be yes, especially for large datasets.
There's a term for statistical subsampling that eludes me right now but I agree with your second point as well. The RANSAC method uses a similar random subsampling approach https://en.wikipedia.org/wiki/Random_sample_consensus
Definitely follow up on this thread if you have more information or results.
It should be trivial to test the assumptions. Unfortunately, the gains should be more pronounced on larger datasets, for which I don't have the computing resources / time to run the full benchmark (single RTX 8000).
However, if there is a good open-source "large" dataset that people have already evolved, given the training set, initial and final hyper-parameters, we could run both 10% approaches and compare for POC. If it shows promise, we would want to test multiple variations / dataset sizes to properly model accuracy of each approach should resources become available.
Knowing when to switch to a sub-sampling approach (should it work) would be amazing when optimizing large models on a budget.
@glenn-jocher Do you have any example of WandB sweep YAML for YOLOv5 ? I'm confused about which method to use (--evolve or Sweep)
@Zegorax see https://wandb.ai/glenn-jocher/COCO128_evolve. This was a 300-generation evolution I ran normally, i.e. python train.py --evolve, not using the sweeps function.
How to find a number of generations left in the Hyperparameter Tuning process?
@ya-stack you can monitor evolution progress by viewing your evolve.csv file. One row is added to this file per generation.
Hi,
Thanks for the wonderful effort in developing and maintaining the YOLOv5 repository.
I ran the following code with the intention of finding the optimal hyperparameters for a custom dataset (via Google Colab).
!python train.py --epochs 10 --img 416 --data gtsdb.yaml --weights runs/train/exp/weights/best.pt --cache --evolve
Based on what I have understood, above command will run for 300 generations with 10 epochs per generation (3000 runs in total).
In cases where the training gets interrupted due to limitations of Google Colab, could I please know the exact command which is required for resuming the hyperparameter evolution process?
I checked this issue as well, where you have instructed to keep evolve.txt
in yolov5
directory. Is it the evolve.csv
that you have mentioned about?
I'm a little confused on what exactly needs to be done to resume the hyperparameter evolution process. Thank you again!
@pranathlcp 👋 Hello! Thanks for asking about resuming evolution.
Resuming YOLOv5 🚀 evolution is a bit different than resuming a normal training run with python train.py --resume
. If you started an evolution run which was interrupted, or finished normally, and you would like to continue for additional generations where you left off, then you pass --resume
and specify the --name
of the evolution you want to resume, i.e.:
Assume you evolve YOLOv5s on COCO128 for 2 epochs for 3 generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 3
If this is your first evolution a new directory runs/evolve/exp
will be created to save your results.
# ├── yolov5
# └── runs
# └── evolve
# └── exp ← evolution saved here
Now assume you want to start a completely separate evolution: YOLOv5s on VOC for 5 epochs for 3 generations. You simply start evolving, and your new evolution will again be logged to a new directory runs/evolve/exp2
:
python train.py --epochs 5 --data VOC.yaml --weights yolov5s.pt --evolve 3
You will now have two evolution runs saved:
# ├── yolov5
# └── runs
# └── evolve
# ├── exp ← first evolution (COCO128)
# └── exp2 ← second evolution (VOC)
If you want to resume the first evolution (COCO128 saved to runs/evolve/exp
), then you use the same exact command you started with plus --resume --name exp
, passing the additional number of generations you want, i.e. --evolve 30
for 30 more generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 30 --resume --name exp
Evolution will run for an additional 30 generations and all new results will be added to the existing runs/evolve/exp/evolve.csv
.
Good luck and let us know if you have any other questions!
@glenn-jocher To perform hyperparameters evolution, should I train the model first and use the trained weights (best.py
) to perform the evolution?
@myasser63 you can evolve any scenario you want. Only you know what scenario you are interested in, I can't tell you that.
@glenn-jocher Can I know the difference between the --evolve
and hyperparameters sweep?. Is the sweep done on the runs of the evolution.
May you add more instructions for W&B sweeps?
@myasser63 the two are very different, especially in regards to the Genetic Evolution algorithm they employ. I wrote the YOLOv5 hyperparameter evolution algorithm, W&B sweeps is a more general tool developed by W&B. @AyushExel @myasser63 is requesting we add additional content or links to this tutorial for W&B sweeps. Can you review the W&B content above and see if it needs updating?
@pranathlcp 👋 Hello! Thanks for asking about resuming evolution.
Resuming YOLOv5 🚀 evolution is a bit different than resuming a normal training run with
python train.py --resume
. If you started an evolution run which was interrupted, or finished normally, and you would like to continue for additional generations where you left off, then you pass--resume
and specify the--name
of the evolution you want to resume, i.e.:Start Evolution
Assume you evolve YOLOv5s on COCO128 for 2 epochs for 3 generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 3
If this is your first evolution a new directory
runs/evolve/exp
will be created to save your results.# ├── yolov5 # └── runs # └── evolve # └── exp ← evolution saved here
Start a Second Evolution
Now assume you want to start a completely separate evolution: YOLOv5s on VOC for 5 epochs for 3 generations. You simply start evolving, and your new evolution will again be logged to a new directory
runs/evolve/exp2
:python train.py --epochs 5 --data VOC.yaml --weights yolov5s.pt --evolve 3
You will now have two evolution runs saved:
# ├── yolov5 # └── runs # └── evolve # ├── exp ← first evolution (COCO128) # └── exp2 ← second evolution (VOC)
Resume an Evolution
If you want to resume the first evolution (COCO128 saved to
runs/evolve/exp
), then you use the same exact command you started with plus--resume --name exp
, passing the additional number of generations you want, i.e.--evolve 30
for 30 more generations:python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 30 --resume --name exp
Evolution will run for an additional 30 generations and all new results will be added to the existing
runs/evolve/exp/evolve.csv
.Good luck and let us know if you have any other questions!
Thank you very much for the detailed reply. I actually waited until the completion of the evolution to respond with my results. The resuming approach which you mentioned, worked perfectly and I could finally have a completed evolution. I have two questions though.
hyp_evolve.yaml
and hyp.yaml
, but both of them have the same hyperparameter values except the following commented-out lines.# YOLOv5 Hyperparameter Evolution Results
# Best generation: 251
# Last generation: 301
# metrics/precision, metrics/recall, metrics/mAP_0.5, metrics/mAP_0.5:0.95, val/box_loss, val/obj_loss, val/cls_loss
# 0.73573, 0.54952, 0.69578, 0.58686, 0.022727, 0.0037956, 0.050243
What exactly is the difference between hyp_evolve.yaml
and hyp.yaml
?
At the end of the evolution run, it is instructed to use hyp_evolve.yaml
though.
evolve.png
plots, the values given for hyperparameters, are different from the values given in the hyp_evolve.yaml
. I was under the impression that the evolve.png
provides the best set of hyperparameter values based on the evolution run.Should we use the hyperparameter values from hyp_evolve.yaml
or hyp.yaml
or evolve.png
?
(In my case though, the values of both hyp_evolve.yaml
and hyp.yaml
are same)
@myasser63 Responded to your other issue with more links.
@glenn-jocher the sweeps tutorial is up-to-date. In the second point, the path utils/wandb_logging/sweep.yaml
needs to be changed to utils/logging/wandb/sweep.yaml
@AyushExel thanks, I've updated second point now to correct path!
@pranathlcp if you believe you have a reproducible bug please raise a new bug report issue, thank you!
If the evolution process is interrupted, how to continue to evolve?
@yizweithree 👋 Hello! Thanks for asking about resuming evolution.
Resuming YOLOv5 🚀 evolution is a bit different than resuming a normal training run with python train.py --resume
. If you started an evolution run which was interrupted, or finished normally, and you would like to continue for additional generations where you left off, then you pass --resume
and specify the --name
of the evolution you want to resume, i.e.:
Assume you evolve YOLOv5s on COCO128 for 2 epochs for 3 generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 3
If this is your first evolution a new directory runs/evolve/exp
will be created to save your results.
# ├── yolov5
# └── runs
# └── evolve
# └── exp ← evolution saved here
Now assume you want to start a completely separate evolution: YOLOv5s on VOC for 5 epochs for 3 generations. You simply start evolving, and your new evolution will again be logged to a new directory runs/evolve/exp2
:
python train.py --epochs 5 --data VOC.yaml --weights yolov5s.pt --evolve 3
You will now have two evolution runs saved:
# ├── yolov5
# └── runs
# └── evolve
# ├── exp ← first evolution (COCO128)
# └── exp2 ← second evolution (VOC)
If you want to resume the first evolution (COCO128 saved to runs/evolve/exp
), then you use the same exact command you started with plus --resume --name exp
, passing the additional number of generations you want, i.e. --evolve 30
for 30 more generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 30 --resume --name exp
Evolution will run for an additional 30 generations and all new results will be added to the existing runs/evolve/exp/evolve.csv
.
Good luck and let us know if you have any other questions!
what is the specific name of the genetic algorithm ? Is it differential evolution or cross entropy method or something else ?
@thsnhtung we use GA with gaussian mutation and elitism, no crossover, population size 1. I wrote it myself but I didn't name it. This same algorithm is also applied in AutoAnchor for anchor evolution. The details are here: https://github.com/ultralytics/yolov5/blob/7473f0f95dbc9ef9dd1706274906c99eac2ee2f9/train.py#L570-L606
@thsnhtung we use GA with gaussian mutation and elitism, no crossover, population size 1. I wrote it myself but I didn't name it. This same algorithm is also applied in AutoAnchor for anchor evolution. The details are here:
Thanks for your reply but I have a little problem on how to disable the hyperparameter. I used hyp.scratch.yaml but it automatically disable shear, perspective, flipud...
@thsnhtung you can prevent a hyperparameter from evolving during hyperparameter evolution by updating it's key in the meta dictionary in train.py: https://github.com/ultralytics/yolov5/blob/540ef0dd30be9bcf6882c9625c49f61c5c764f52/train.py#L529-L559
I am using the following python train.py --img 512 --batch 32 --epochs 10 --data {yolov5_data}/data.yaml --cfg {yolov5_model}/models/custom_yolov5s.yaml --weights yolov5s.pt --name yolov5s_results
as per the documentation, it should use the 'data/hyps/hyp.finetune.yaml' file for the hyper parameters. I however noticed another hyp.yaml file in the runs/train/yolov5s_results folder which has totally different values from the hyp.finetune.yaml file. Is the model using the 'hyp.yaml' in the results folder?
@mayukhberkeley --hyp argument is here: https://github.com/ultralytics/yolov5/blob/c2523be634a94da2b1b2a43c11b25827a0de990d/train.py#L445
@thsnhtung you can prevent a hyperparameter from evolving during hyperparameter evolution by updating it's key in the meta dictionary in train.py:
I know we need to change meta dictionary in train.py. I wonder how to do so. like changing the upper limit = lower limit...
@thsnhtung setting mutation scale to 0 prevents a value from changing.
@mayukhberkeley --hyp argument is here:
https://github.com/ultralytics/yolov5/blob/c2523be634a94da2b1b2a43c11b25827a0de990d/train.py#L445
@glenn-jocher you message here https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution#issuecomment-680685682 says that
"data/hyp.finetune.yaml will be automatically used by python train.py --weights yolov5s.pt"
My question was since I was using --weights yolov5s.pt, should it not have used data/hyp.finetune.yaml ?
@mayukhberkeley https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution#issuecomment-680685682 was out of date, have updated now.
Hi, I am trying to use evolve for my custom dataset on colab with this line of code: !python train.py --img 416 --batch 16 --epochs 10 --data {dataset.location}/data.yaml --weights yolov5s.pt --cache --evolve 5
And it gives the error on the first epoch: 0% 0/1410 [00:00<?, ?it/s]src/tcmalloc.cc:283] Attempt to free invalid pointer 0x3d5436903d7c68ec
How can I fix this?
📚 This guide explains hyperparameter evolution for YOLOv5 🚀. Hyperparameter evolution is a method of Hyperparameter Optimization using a Genetic Algorithm (GA) for optimization. UPDATED 28 March 2023.
Hyperparameters in ML control various aspects of training, and finding optimal values for them can be a challenge. Traditional methods like grid searches can quickly become intractable due to 1) the high dimensional search space 2) unknown correlations among the dimensions, and 3) expensive nature of evaluating the fitness at each point, making GA a suitable candidate for hyperparameter searches.
Before You Start
Clone repo and install requirements.txt in a Python>=3.7.0 environment, including PyTorch>=1.7. Models and datasets download automatically from the latest YOLOv5 release.
1. Initialize Hyperparameters
YOLOv5 has about 30 hyperparameters used for various training settings. These are defined in
*.yaml
files in the/data
directory. Better initial guesses will produce better final results, so it is important to initialize these values properly before evolving. If in doubt, simply use the default values, which are optimized for YOLOv5 COCO training from scratch.https://github.com/ultralytics/yolov5/blob/2da2466168116a9fa81f4acab744dc9fe8f90cac/data/hyps/hyp.scratch-low.yaml#L2-L34
2. Define Fitness
Fitness is the value we seek to maximize. In YOLOv5 we define a default fitness function as a weighted combination of metrics:
mAP@0.5
contributes 10% of the weight andmAP@0.5:0.95
contributes the remaining 90%, with PrecisionP
and RecallR
absent. You may adjust these as you see fit or use the default fitness definition (recommended). https://github.com/ultralytics/yolov5/blob/4103ce9ad0393cc27f6c80457894ad7be0cb1f0d/utils/metrics.py#L12-L163. Evolve
Evolution is performed about a base scenario which we seek to improve upon. The base scenario in this example is finetuning COCO128 for 10 epochs using pretrained YOLOv5s. The base scenario training command is:
To evolve hyperparameters specific to this scenario, starting from our initial values defined in Section 1., and maximizing the fitness defined in Section 2., append
--evolve
:The default evolution settings will run the base scenario 300 times, i.e. for 300 generations. You can modify generations via the
--evolve
argument, i.e.python train.py --evolve 1000
. https://github.com/ultralytics/yolov5/blob/6a3ee7cf03efb17fbffde0e68b1a854e80fe3213/train.py#L608The main genetic operators are crossover and mutation. In this work mutation is used, with a 80% probability and a 0.04 variance to create new offspring based on a combination of the best parents from all previous generations. Results are logged to
runs/evolve/exp/evolve.csv
, and the highest fitness offspring is saved every generation asruns/evolve/hyp_evolved.yaml
:We recommend a minimum of 300 generations of evolution for best results. Note that evolution is generally expensive and time consuming, as the base scenario is trained hundreds of times, possibly requiring hundreds or thousands of GPU hours.
4. Visualize
evolve.csv
is plotted asevolve.png
byutils.plots.plot_evolve()
after evolution finishes with one subplot per hyperparameter showing fitness (y axis) vs hyperparameter values (x axis). Yellow indicates higher concentrations. Vertical distributions indicate that a parameter has been disabled and does not mutate. This is user selectable in themeta
dictionary in train.py, and is useful for fixing parameters and preventing them from evolving.Environments
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
Status
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.