sherpa-ai / sherpa

Hyperparameter optimization that enables researchers to experiment, visualize, and scale quickly.
http://parameter-sherpa.readthedocs.io/
GNU General Public License v3.0
333 stars 54 forks source link

Why does Sherpa produce repeated hyperparameter configurations? #42

Closed AlexFuster closed 5 years ago

AlexFuster commented 5 years ago

Hey guys,

I'm running mnist_mlp.ipynb from examples/ and in my case, at certain point, the algorithm repeats hyperparameter combinations: Trial 6: {'num_units': 111, 'activation': 'relu'} Trial 7: {'num_units': 111, 'activation': 'relu'} Trial 10: {'num_units': 111, 'activation': 'relu'} Trial 14: {'num_units': 111, 'activation': 'relu'} I don't understand how could this be useful. I mean, shouldn't it keep track and avoid to reevaluate combinations?

After spamming {'num_units': 111, 'activation': 'relu'} (which is the best one found until now), it converges at non optimal configurations Trial 26: {'num_units': 116, 'activation': 'sigmoid'} after many trials of similar configurations... Trial 72: {'num_units': 115, 'activation': 'sigmoid'}

I know you you are just wrapping this algorithm from another repo, but, as I'm getting this running one of your examples, I guess you have probably faced this questions as well

thanks

AlexFuster commented 5 years ago

The repeated combinations doesn't seem to be specific of GPYOPT, I tried the algorithm sherpa.algorithms.Genetic and I'ts trying repeated combinations:

Trial 42: {'hidden_size': 264, 'n_layers': 2, 'activation': <function relu at 0x7f1ffe869268>, 'lr': 0.009493107893987409, 'dropout': 0.13462578033866046}

Trial 64: {'hidden_size': 264, 'n_layers': 2, 'activation': <function relu at 0x7f1ffe869268>, 'lr': 0.009493107893987409, 'dropout': 0.13462578033866046}

Can anyone explain why is this happening? I am confused because I know this has been tested a thousand times and I'm just missing something conceptually evident, but I can't figure out what.

LarsHH commented 5 years ago

Hi Alex,

I had to dig a little to figure this out.

1) Bayesian Optimization/GPyOpt repeats trials: I am working on fixing this. Essentially to model a Sherpa Discrete variable we used a GPyOpt continuous variable and discretized it. Therefore, if GPyOpt tried 111.4 ( => 111) and it then wants to try 111.1 ( => 111) you have a repeat. It turns out that we actually don't have to do that since GPyOpt has functionality to take care of this.

2) Genetic algorithm: this is a separate issue from the first and an artifact of the algorithm. If the best parameter setting does not change for a while then eventually you're bound to get a case where for all five parameters you copy them from that best parameter setting. @colladou do you have any comments? I don't exactly know what the bounds look like ( https://github.com/sherpa-ai/sherpa/blob/14bc6aa7975e876bc9a60a33105fa3774cf4a403/sherpa/algorithms/core.py#L678 ) but it seems to me that the chance may not be that low to get a duplicate.

AlexFuster commented 5 years ago

Nice, that was a tricky one. I'm gonna try all the algorithms available at Sherpa to see if they repeat combinations

AlexFuster commented 5 years ago

By the way, the Genetic algorithm is not in the docs: https://parameter-sherpa.readthedocs.io/en/latest/algorithms/algorithms.html# I could just find it in the code

AlexFuster commented 5 years ago

I tested population-based and it doesn't seem to produce repeated configurations, soI will use it instead of the Genetic algorithm. I think we should focus on the bug with GPyOpt

AlexFuster commented 5 years ago

Is the issue with GPyOpt in the roadmap?

AlexFuster commented 5 years ago

I see this issue is closed in #45

AlexFuster commented 5 years ago

It seems that the solution for #48 has caused this to happen again. Using mnist_mlp I obtain tons of repeated combinations. Could you please run that example and see if you can replicate this error?

LarsHH commented 5 years ago

That is strange. GPyOpt should now be producing integer values (but as float) and Sherpa is just turning them to int afterwards. So there shouldn't be any repetition due to rounding. I'm looking into it.

LarsHH commented 5 years ago

So after 18 trials I also get some repetition image You said that it doesn't converge to the global optimum? Do you remember what the global optimum was? I think one issue might also be that there is more variation within the hyperparameter settings (for the good ones) than between.

AlexFuster commented 5 years ago

It seems to reach the global optimum, which is around {'Trial-ID': 40, 'Iteration': 6, 'activation': 'relu', 'num_units': 119, 'Objective': 0.06359722506762482} But in my case, it ended up converging in a suboptimal: Trial 57: {'num_units': 122, 'activation': 'sigmoid', 'Objective': 0.0702} ... Trial 168: {'num_units': 122, 'activation': 'sigmoid', 'Objective': 0.0770}

I think, and that is a task for the user, not for sherpa, that for neural networks, the seed of the initialization of the network should be constant to every trial, since a configuration can obtain a lower loss than a better one, just by chance due to its initialization. I am going to implement it in the Keras example and I'll tell you if that fixes the convergence issue.

About the repetitions, I would just want to know if they are an artifact produced by GPyOpt or if there is any bug in Sherpa that produces them. In case it is GPyOpt's fault, It would be nice for the sherpa.study to keep a register of tried configurations and their objectives, so when the algorithm requests for a repeated configuration, the study just gives it the saved objective, instead of training a model.

LarsHH commented 5 years ago

Hi Alex, I'm just now revisiting this and thinking about what you said. One option would be to have a general "unique values" flag somewhere that assures that no values are repeated. I think the user should be able to specify that since people may think differently about whether they want to have repeated values or not. It would also keep the interface cleaner since repetition could in theory happen with many other algorithms too (e.g. Random Search with discrete parameter options).

AlexFuster commented 5 years ago

Hi Lars,

I think that is a good and general solution. The little overhead added by the search is nothing compared to the cost of retraining a combination.

My only concern is that if you have something like this:

if unique_values:
    repeated=True
    while repeated:
        parameters=generate_hyperparameter_combination()
        repeated=already_tried(parameters)
    return parameters

you will enter in an infinite loop when algorithm converges, so make sure to detect that convergence when it happens in order to finish the study.

LarsHH commented 5 years ago

Closing this issue: this is now a task: https://github.com/sherpa-ai/sherpa/projects/1#card-28465658