ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.13k stars 3.43k forks source link

HYPERPARAMETER EVOLUTION #392

Closed glenn-jocher closed 3 years ago

glenn-jocher commented 5 years ago

Training hyperparameters in this repo are defined in train.py, including augmentation settings: https://github.com/ultralytics/yolov3/blob/df4f25e610bc31af3ba458dce4e569bb49174745/train.py#L35-L54

We began with darknet defaults before evolving the values using the result of our hyp evolution code:

python3 train.py --data data/coco.data --weights '' --img-size 320 --epochs 1 --batch-size 64 -- accumulate 1 --evolve

The process is simple: for each new generation, the prior generation with the highest fitness (out of all previous generations) is selected for mutation. All parameters are mutated simultaneously based on a normal distribution with about 20% 1-sigma: https://github.com/ultralytics/yolov3/blob/df4f25e610bc31af3ba458dce4e569bb49174745/train.py#L390-L396

Fitness is defined as a weighted mAP and F1 combination at the end of epoch 0, under the assumption that better epoch 0 results correlate to better final results, which may or may not be true. https://github.com/ultralytics/yolov3/blob/bd924576048af29de0a48d4bb55bbe24e09537a6/utils/utils.py#L605-L608

An example snapshot of the results are here. Fitness is on the y axis (higher is better). from utils.utils import *; plot_evolution_results(hyp) evolve

YRunner commented 5 years ago

I had this problem. 'shape '[16, 3, 85, 13, 13]' is invalid for input of size 56784'.The problem is located in the code here,' p = p.view(bs, self.na, self.nc + 5, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous() # prediction',I'm a green hand and I'd appreciate any advice.

glenn-jocher commented 5 years ago

@YRunner this issue is dedicated only to hyperparameter evolution. Is your post in reference to this topic?

Chida15 commented 5 years ago

I got the result like this, is it normal? [ evolve The fitness is very low, should I epoch more times?

glenn-jocher commented 5 years ago

@Chida15 haha, yes, well good job, you've run two different models here, the orange points, and it's showing you the best result highlighted in blue. For this to be effective you want to evolve hundreds of mutations. So I would change the for loop here to at least 200 generations. https://github.com/ultralytics/yolov3/blob/e77ca7e4d969cd2e3d1a741e648934a94575868d/train.py#L371

Chida15 commented 5 years ago

@Chida15 haha, yes, well good job, you've run two different models here, the orange points, and it's showing you the best result highlighted in blue. For this to be effective you want to evolve hundreds of mutations. So I would change the for loop here to at least 200 generations. https://github.com/ultralytics/yolov3/blob/e77ca7e4d969cd2e3d1a741e648934a94575868d/train.py#L371

ok, thanks a lot!

sanazss commented 5 years ago

Hi. I am trying to plot the evolution results but get an error that hyp is not defined. I am applying your latest version of repo. Any hint on that?thanks

sanazss commented 5 years ago

I solved it.

glenn-jocher commented 5 years ago

@sanazss ah yes, you need to define hyp before running: from utils.utils import *; plot_evolution_results(hyp)

varghesealex90 commented 5 years ago
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 120200
policy=steps
steps=70000,100000
scales=.1,.1

I see these params in the cfg file. I would like to use the same parameters . In such case, how would the updated hyp be?

hyp = {'giou': 1.582,  # giou loss gain
       'xy': 4.688,  # xy loss gain
       'wh': 0.1857,  # wh loss gain
       'cls': 27.76,  # cls loss gain  (CE should be around ~1.0)
       'cls_pw': 1.446,  # cls BCELoss positive_weight
       'obj': 21.35,  # obj loss gain
       'obj_pw': 3.941,  # obj BCELoss positive_weight
       'iou_t': 0.2635,  # iou training threshold
       'lr0': 0.002324,  # initial learning rate
       'lrf': -4.,  # final LambdaLR learning rate = lr0 * (10 ** lrf)
       'momentum': 0.97,  # SGD momentum
       'weight_decay': 0.0004569,  # optimizer weight decay
       'hsv_s': 0.5703,  # image HSV-Saturation augmentation (fraction)
       'hsv_v': 0.3174,  # image HSV-Value augmentation (fraction)
       'degrees': 1.113,  # image rotation (+/- deg)
       'translate': 0.06797,  # image translation (+/- fraction)
       'scale': 0.1059,  # image scale (+/- gain)
       'shear': 0.5768}  # image shear (+/- deg)
glenn-jocher commented 5 years ago

@varghesealex90 the hyp dictionary is pretty self explanatory. The key names are the same in many cases to what you have above, i.e. hyp['momentum'] etc.

The parameters we do not use are angle, hue, burn_in. The LR scheduler hyps are set to reduce at 80% and 90% of total epochs with scales of 0.1 and 0.1 already.

In any case, the hyps have been evolved to their present state because they improve performance over what you have, so I would not change them unless you are experimenting.

DanielChungYi commented 4 years ago

I have a problem here , p = p.view(bs, self.na, self.nc + 5, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous() # prediction RuntimeError: shape '[6, 3, 10, 13, 13]' is invalid for input of size 18252 I still can't fix it, can anyone help me?

glenn-jocher commented 4 years ago

@DanielChungYi is your error reproducible in a new git clone?

DanielChungYi commented 4 years ago

@glenn-jocher I did clone the latest version of the code, but the problem still there. Please help me.

DanielChungYi commented 4 years ago

擷取

glenn-jocher commented 4 years ago

@DanielChungYi ok I see. It's likely an issue with your custom dataset, as we can not reproduce this on the coco data. Unless you can supply a minimum reproducible example on the coco dataset there is not much we can do.

glenn-jocher commented 4 years ago

@DanielChungYi also check your cfg and your number of classes, as you might have a mismatch.

millermuttu commented 4 years ago

even i am getting the same error

millermuttu commented 4 years ago
p = p.view(bs, self.na, self.nc + 5, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous()  # prediction

RuntimeError: shape '[64, 3, 8, 10, 10]' is invalid for input of size 19200

millermuttu commented 4 years ago

@DanielChungYi also check your cfg and your number of classes, as you might have a mismatch.

擷取

did you solved this problem????

Pari-singh commented 4 years ago

Hi @glenn-jocher Could you explain what "generations" and "mutations" mean? I am a bit confused as I ran my train.py with evolve for 8 epochs (Planning on training with 80 epochs) and there was 1 generation to evolve (ie., in train.py, for _ in range(1)), is that correct way to do? I got the evolve.txt though

glenn-jocher commented 4 years ago

@glenn-jocher generations is the number of generations to evolve for. For optimal results I recommend at least 100 generations, preferably 300 or more.

Generations is set at 1 due to a memory leak bug. This example shows COCO evolution code using a bash for loop as a workaround:

while true
do
  python3 train.py --weights '' --prebias --img-size 512 --batch-size 16 --accumulate 4 --evolve --epochs 27 --device 4
done

A mutation is a change to the genome of the offspring, what differentiates it from its parent. This repo uses a form of communal asexual evolution, where the best prior example of all possible ancestors is mutated to create the next offspring.

Walstruzz commented 4 years ago

Hi, does the weight yolov3.pt train using

hyp = {'giou': 1.2,  # giou loss gain 
        'xy': 4.062,  # xy loss gain 
        'wh': 0.1845,  # wh loss gain 
        'cls': 15.7,  # cls loss gain 
        'cls_pw': 3.67,  # cls BCELoss positive_weight 
        'obj': 20.0,  # obj loss gain 
        'obj_pw': 1.36,  # obj BCELoss positive_weight 
        'iou_t': 0.194,  # iou training threshold 
        'lr0': 0.00128,  # initial learning rate 
        'lrf': -4.,  # final LambdaLR learning rate = lr0 * (10 ** lrf) 
        'momentum': 0.95,  # SGD momentum 
        'weight_decay': 0.000201,  # optimizer weight decay 
        'hsv_s': 0.8,  # image HSV-Saturation augmentation (fraction) 
        'hsv_v': 0.388,  # image HSV-Value augmentation (fraction) 
        'degrees': 1.2,  # image rotation (+/- deg) 
        'translate': 0.119,  # image translation (+/- fraction) 
        'scale': 0.0589,  # image scale (+/- gain) 
        'shear': 0.401}  # image shear (+/- deg) 

and --img-size 320 ? THX.

glenn-jocher commented 4 years ago

@Walstruzz yolov3.pt is yolov3.weights, the original darknet weights, converted to pytorch format.

Alchemist77 commented 4 years ago

Hi,

Thank you for your post.

I am finding a good hyperparameters of my own dataset, but after two times repetitions, I got this error when i run this code after

python3 train.py --data data/my_own.data --img-size 320 --epochs 1 --batch-size 16 --accumulate 4 --evolve --weights ''
train.py:463: RuntimeWarning: invalid value encountered in true_divide
  x = (x[:n] * w.reshape(n, 1)).sum(0) / w.sum()  # new parent

WARNING: non-finite loss, ending training  tensor([nan, nan, nan, nan], device='cuda:0')
  0%|                                                                                                | 0/56 [00:02<?, ?it/s]

     shear translate      giou     hsv_s  fl_gamma  momentum       cls     scale       lr0   degrees     hsv_v     iou_tweight_decay       obj    obj_pw     hsv_h       lrf    cls_pw
       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan         1       nan       nan         1
Evolved fitness:          0         0         0         0         0         0         0

          shear: nan
      translate: nan
           giou: nan
          hsv_s: nan
       fl_gamma: nan
       momentum: nan
            cls: nan
          scale: nan
            lr0: nan
        degrees: nan
          hsv_v: nan
          iou_t: nan
   weight_decay: nan
            obj: nan
         obj_pw: 1
          hsv_h: nan
            lrf: nan
         cls_pw: 1

workspace/pytorch/yolov3/utils/datasets.py:711: RuntimeWarning: invalid value encountered in greater
  i = (w > 4) & (h > 4) & (area / (area0 + 1e-16) > 0.1) & (ar < 10)
workspace/pytorch/yolov3/utils/datasets.py:711: RuntimeWarning: invalid value encountered in less
  i = (w > 4) & (h > 4) & (area / (area0 + 1e-16) > 0.1) & (ar < 10)

I think that when I put --weights ' ', it has problem described above. (Also I run python3 train.py --data data/coco_64img.data --img-size 320 --epochs 1 --batch-size 16 --accumulate 1 --evolve --weights '', but it was not working. However, when I use pretrained weight, it was working)

Should I train my own dataset to get my own weight then I can evolve the hyperparameters of my own dataset with my own weight?? Otherwise, is it problem of my own dataset (I already finished training 1000 epochs with my own dataset from scratch)

glenn-jocher commented 4 years ago

@Alchemist77 you can start your training from prior weights or not, it's up to you. Prior weights generally provide you results much sooner, though we get best results on COCO when training from scratch. If you see this error I would simply restart the evolution. It picks up where it left off of, reading evolve.txt to select the best parent, so no work is lost.

rabdulatipoff commented 4 years ago

Excellent job on this repo and the evolution algorithm! I've tried to do it for my weights (transferred learning from yolov3.pt on my dataset of 1200/400 pics, then converted them into darknet format and back in order to reset epochs before training) and I noticed that some of the parameters are a bit off both on graphs and in print_mutation( output. I'm not sure but it feels like, for instance, lrf and GIoU have switched places in this example, this must be the results from 22 generations AFAIK. I know I should train for longer but I'm not sure if there's a flaw of your algorithm in this regard. What do you think about it? Thanks in advance! evolve stdout

okanlv commented 4 years ago

@glenn-jocher generations is the number of generations to evolve for. For optimal results I recommend at least 100 generations, preferably 300 or more.

Generations is set at 1 due to a memory leak bug. This example shows COCO evolution code using a bash for loop as a workaround:

while true
do
  python3 train.py --weights '' --prebias --img-size 512 --batch-size 16 --accumulate 4 --evolve --epochs 27 --device 4
done

A mutation is a change to the genome of the offspring, what differentiates it from its parent. This repo uses a form of communal asexual evolution, where the best prior example of all possible ancestors is mutated to create the next offspring.

@glenn-jocher How could I reproduce memory leak bug? I could try to find a solution. I tried to increase range in the following line, but that did not result in memory leak.

https://github.com/ultralytics/yolov3/blob/b87bfa32c36f582d21ac3da7b21d9d9178d339ba/train.py#L471

glenn-jocher commented 4 years ago

@okanlv ideally line 471 there would read for _ in range(300): to evolve for 300 generations (which roughly seems to be a good point of diminishing returns), but this causes the GPU memory to grow every generation, by maybe 1GB per generation (!), so for example if your training uses 9GB on a 2080Ti then you will get a CUDA out of memory error after only a few generations. Hence the bash while loop workaround.

To reproduce quickly, you could use a full size yolov3-spp.cfg on a tiny dataset like this:

python3 train.py --cfg yolov3-spp.cfg --weights '' --epochs 10 --data coco_64img.data --evolve
glenn-jocher commented 4 years ago

@rabdulatipoff we updated the evolution results plotting about a month ago to fix a bug in the plot labels. You might want to git pull and try plotting again.

okanlv commented 4 years ago

@glenn-jocher Hey, I didn't get CUDA oom error with that command. I have tried some other things, but it worked without any problem. I have used the latest stable pytorch in my environment. If you are still getting that error, I could suggest a few things.

glenn-jocher commented 4 years ago

@okanlv oh! Maybe a PyTorch bug has been fixed in the latest releases. Thanks for looking, I'll try running it over here.

DeepLearning723 commented 4 years ago

Training hyperparameters in this repo are defined in train.py, including augmentation settings: https://github.com/ultralytics/yolov3/blob/df4f25e610bc31af3ba458dce4e569bb49174745/train.py#L35-L54

We began with darknet defaults before evolving the values using the result of our hyp evolution code:

python3 train.py --data data/coco.data --weights '' --img-size 320 --epochs 1 --batch-size 64 -- accumulate 1 --evolve

The process is simple: for each new generation, the prior generation with the highest fitness (out of all previous generations) is selected for mutation. All parameters are mutated simultaneously based on a normal distribution with about 20% 1-sigma: https://github.com/ultralytics/yolov3/blob/df4f25e610bc31af3ba458dce4e569bb49174745/train.py#L390-L396

Fitness is defined as a weighted mAP and F1 combination at the end of epoch 0, under the assumption that better epoch 0 results correlate to better final results, which may or may not be true. https://github.com/ultralytics/yolov3/blob/bd924576048af29de0a48d4bb55bbe24e09537a6/utils/utils.py#L605-L608

An example snapshot of the results are here. Fitness is on the y axis (higher is better). from utils.utils import *; plot_evolution_results(hyp) evolve

Hi, thanks for your work. I see the result of your hyps enolve like this. 捕获2 For example, the value of lr0 is 0.00146 in your image. Is this value optimal by the plot_evolution_results(hyp)?

glenn-jocher commented 4 years ago

@DeepLearning723 yes, blue point on x axis.

tanujkamde commented 4 years ago

@glenn-jocher I trained my model for students detection in classroom using top view camera, it is working fine. But when I changed camera position the detection is not like previous. Here I am using YOLOv3-tiny This is the result for camera position while training 1_yolov3_tiny

This is the result with changed in camera position YOLOv3-tiny_53_8000

Please suggest me how to increase my detection for changed camera position. I trained my YOLOv3-tiny model for 5,00,000 steps, I checked for 2k,3k steps also but it didnt work.

Ownmarc commented 4 years ago

@tanujkamde you should include training images of any position the camera may have during inference else you are overfitting a certain camera angle and you will always get bad results like you did.

@glenn-jocher , do you know if the memory leak is fixed for evolve ?

@okanlv , what was your torch version when you didn't get the memory leak ? Also, what was the size of the dataset you were training on ?

Edit : Started evolving and monitoring GPU vram on every training and everything seems to go well (i modified the code in train.py and didn't use bash loop):

evolution 0: 8994MiB
evolution 1: 9160MiB
evolution 2: 9146MiB
evolution 3: 9146MiB
evolution 4: 9130MiB
tanujkamde commented 4 years ago

@tanujkamde you should include training images of any position the camera may have during inference else you are overfitting a certain camera angle and you will always get bad results like you did.

@glenn-jocher , do you know if the memory leak is fixed for evolve ? Thanks for reply I will try training with different camera angles also. No, I don’t have any idea about memory leak. Will you please help me out about this.

glenn-jocher commented 4 years ago

@Ownmarc the memory leak remains a mystery. It's still present on coco2014.data training with yolov3-spp.cfg, but on smaller datasets like coco64.data I can't reproduce it. So perhaps you're good to go. In any case the two options (which are functionally identical) are in https://github.com/ultralytics/yolov3/issues/392#issuecomment-565475680. If you have no memory leak you can of course keep using the python for loop in train.py. The evolution code has been updated recently to a lower mutation probability, which is more in line with common practices I believe.

For our own work we are evolving partially trained datasets. So for example if full training is 1000 epochs, we evolve based on the fitness after 100 or 200 epochs. And remember you need to ideally run several hundred generations for best results. If you supply a GCP bucket folder under --bucket you can direct multiple VMs to evolve in parallel based on the same gs://bucket/evolve.txt file, as all VMs will read from and write to the same file.

@tanujkamde your training data needs to span the entire parameter space you expect to run inference on. This means that if you expect to run inference at location A or with camera X you need to include labelled examples of those in the training data, otherwise your dataset lacks variety, regardless of the quantity.

yoga-0125 commented 4 years ago

I run train.py with --evolve using default hyp values in the file. However, the evolve.png only shows one point in each figure (default parameter value). How to solve it? evolve

glenn-jocher commented 4 years ago

@yoga-0125 evolve is currently only set to run for 1 generation, but you can modify this here. Ideally you want to run for several hundred generations, which should take some time. See https://github.com/ultralytics/yolov3/issues/392#issuecomment-576919931 for details. https://github.com/ultralytics/yolov3/blob/578e7f9500bb94d36e5a7d72b2402d6933189969/train.py#L445-L450

kossolax commented 4 years ago

Hello,

How do you handle the non-determinism of the training? Just use more epoch and generation ?

Thanks

glenn-jocher commented 4 years ago

@kossolax see https://pytorch.org/docs/stable/notes/randomness.html

The repo is intended for full reproducibility, so in theory if you train twice you should end up with identical models, but in practice this rarely happens, especially across different hardware, and the different random number generators spread around numpy, python, and pytorch.

hackobi commented 4 years ago

if I use the --evolve flag the model doesn't save the weights. Is that intended or am I missing something?

Ringhu commented 4 years ago

Hi @glenn-jocher , I have a question about --evolve flag. I set `
for _ in range(1): # generations to evolve if os.path.exists('evolve.txt'): # if evolve.txt exists: select best hyps and mutate

Select parent(s)

     x = np.loadtxt('evolve.txt', ndmin=2) 
     parent = 'single'  # parent selection method: 'single' or 'weighted' `

generations from 1 to 10, but still got only one point in the figure. Did I miss something? evolve

glenn-jocher commented 4 years ago

@hackobi no, the weights are not saved. The purpose of --evolve is to train hundreds of times with many different variations of the hyperparameters. The fitness and hyperparameters are saved in evolve.txt, from best to worst. Once you've evolved long enough you can use the best hyperparameters to train a model normally, which will be saved.

@Ringhu the plots visualize the results in evolve.txt, so if you look there you should see 10 rows. If all of the rows are the same you are doing something wrong though.

hackobi commented 4 years ago

that makes sense @glenn-jocher thanks for the quick reply!

Ringhu commented 4 years ago

@glenn-jocher Thank you for ur reply. I set epoch to 2 and it works, after every 2 epochs it generates a row. But I don't know how many epochs does each generation need? Is 2 epochs enough or each generation need 273 or more epochs?

glenn-jocher commented 4 years ago

@Ringhu theres no hard rules here. For smaller datasets (and in an ideal world) you might evolve using the full number of epochs.

For larger datasets like COCO you might evolve based on a shorter number of epochs to save time, though beware that the best hyps for short term results will not correlate 100% with the best long term hyps.

In practice for COCO we use 10% of full training to evolve, 27 epochs.

qwe3208620 commented 4 years ago

@glenn-jocher 27 (10%epoch) * 300 (generation) means that we will need to spend 30 times of coco training time? Or the training will be faster here. Thanks

glenn-jocher commented 4 years ago

@qwe3208620 yes that is correct. Of course, you can evolve for less generations also to arrive at a slightly reduced fitness solution.

A genetic algorithm by definition requires multiple evaluations of the cost function.

logic03 commented 4 years ago

Hello, I don't know how the super parameter evolution mechanism here is obtained by genetic algorithm, could you please explain?In addition, my data set only has 1128 pedestrians. Is it reasonable to set the number of iterations to 10 in the case of hyperparameter evolution?