msmbuilder / osprey

🦅Hyperparameter optimization for machine learning pipelines 🦅
http://msmbuilder.org/osprey
Apache License 2.0
74 stars 26 forks source link

Gaussian Processes optimization doesn't always converge answer #226

Closed RobertArbon closed 7 years ago

RobertArbon commented 7 years ago

Just tried this on the current development release.

strategy:
  name: gp
  params:
    seeds: 5

search_space:
  C:
    min: 0.1
    max: 10
    warp: log
    type: float

  gamma:
    min: 1e-5
    max: 1
    warp: log
    type: float

cv: 5

dataset_loader:
  name: sklearn_dataset
  params:
    method: load_digits

trials:
    uri: sqlite:///osprey-trials.db

random_seed: 42

After running for 20 iterations it:

  1. sometimes crashes (numpy.linalg.linalg.LinAlgError: not positive definite, even with jitter.)
  2. When it doesn't the results do seem to be better than Grid search, but they don't seem to converge in a smooth way, which is what you'd expect from a Gaussian Process optimization:

Grid Search:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Best Current Model = 0.969393 +- 0.017132
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
                 C       10.0
                 gamma   0.00017782794100389227

Gaussian Processes:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Best Current Model = 0.972732 +- 0.015372
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
                 C       7.969454818643936
                 gamma   0.0007459343285726547

results

I have some questions regarding your implementation:

  1. The acquisition function seems odd - it's not the expected improvement a la Snoek. Why add the mean to the variance?
  2. The bounds used to define the maximimisation of the acquisition function don't reflect the bounds of the variables - they're all 0 - 1.
  3. See my other question

To investigate these issues I looked at how this code implemented optimization. It allows you to used a the sci-py minimize() function AND a random sampling algorithm to maximize the acquisition function. After having a play I found the minimize function not to converge on a correct answer and didn't show increasing/converging score values. Using the random sampling method gave a converging answer. Spearmint uses a MCMC algo as well.

I've implemented the expected improvement a la Snoek, with a Matern52 kernel and a random sampling maximisation algorithm in Osprey in a new branch. I get the following results.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Best Current Model = 0.974958 +- 0.013960
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
                 C       10.0
                 gamma   0.0004987205335704034

results-rea

They seem marginally better but maybe it's just this case.

Thanks for you time.

Rob

rmcgibbo commented 7 years ago

IIRC I originally wrote it to use the 68% CI upper bound as the acquisition function. I'm sure why I chose that, but I assume it was related to the fact that it was easy to implement. I'm not sure about the development that's gone on more recently since others have been maintaining the code.

Sent from my iPhone

On Jun 5, 2017, at 6:13 PM, RobertArbon notifications@github.com wrote:

Just tried this on the current development release.

strategy: name: gp params: seeds: 5

search_space: C: min: 0.1 max: 10 warp: log type: float

gamma: min: 1e-5 max: 1 warp: log type: float

cv: 5

dataset_loader: name: sklearn_dataset params: method: load_digits

trials: uri: sqlite:///osprey-trials.db

random_seed: 42 After running for 20 iterations it:

sometimes crashes (numpy.linalg.linalg.LinAlgError: not positive definite, even with jitter.) When it doesn't the results do seem to be better than Grid search, but they don't seem to converge in a smooth way, which is what you'd expect from a Gaussian Process optimization: Grid Search:

Best Current Model = 0.969393 +- 0.017132

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) C 10.0 gamma 0.00017782794100389227 Gaussian Processes:

Best Current Model = 0.972732 +- 0.015372

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) C 7.969454818643936 gamma 0.0007459343285726547

I have some questions regarding your implementation:

The acquisition function seems odd - it's not the expected improvement a la Snoek. Why add the mean to the variance? The bounds used to define the maximimisation of the acquisition function don't reflect the bounds of the variables - they're all 0 - 1. See my other question To investigate these issues I looked at how this code implemented optimization. It allows you to used a the sci-py minimize() function AND a random sampling algorithm to maximize the acquisition function. After having a play I found the minimize function not to converge on a correct answer and didn't show increasing/converging score values. Using the random sampling method gave a converging answer. Spearmint uses a MCMC algo as well.

I've implemented the expected improvement a la Snoek, with a Matern52 kernel and a random sampling maximisation algorithm in Osprey in a new branch. I get the following results.

Best Current Model = 0.974958 +- 0.013960

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) C 10.0 gamma 0.0004987205335704034

They seem marginally better but maybe it's just this case.

Thanks for you time.

Rob

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cxhernandez commented 7 years ago

The acquisition function seems odd - it's not the expected improvement a la Snoek. Why add the mean to the variance?

I think the idea of adding the variance to the mean in the acquisition function was to try to search for the largest possible gain in performance and reduce uncertainty in hyperparameter space. We weren't necessarily concerned with smooth convergence, as sampling similar values in variable-space might lead to redundancy in integer-space. There also wasn't necessarily any literature we were following.

The bounds used to define the maximimisation of the acquisition function don't reflect the bounds of the variables - they're all 0 - 1.

Not sure I follow here. The optimization problem solved by scipy.optimize should have bounds [0, 1] for all input variables.

See my other question

It's been a while, but IIRC I was following a tutorial from the GPy website (don't remember which). The idea behind the Fixed kernel was to incorporate the uncertainties in the score inferred from cross-validation into the GP model. It's probably unnecessary, and I'm open to changing this if it improves performance.

RobertArbon commented 7 years ago

RE: Acquisition function - OK. I got confused because the website said Expected Improvement so I took that literally.
RE: Optimise: OK I didn't look properly in the searchspace class for the normalisation - my apologies (it was late this side of the Atlantic). RE: Kernels - OK. I'm comparing them now, if there's anything interesting to say I'll make a PR.

Thanks for your time. This code (along with MSMBuilder & OpenMM) is a delight to work with so keep up the good work.

rmcgibbo commented 7 years ago

Dev. notes: It would be awesome to add expected improvement and make the kernels more configurable.

cxhernandez commented 7 years ago

Thanks for your time. This code (along with MSMBuilder & OpenMM) is a delight to work with so keep up the good work.

Glad to hear it! Thank you for being an active contributor to our software!

RobertArbon commented 7 years ago

I'm writing configurable Kernels and Expected Improvement at the minute. Will make a PR in next couple of days.

The Kernels can be added as a list of GPy entry points.

cxhernandez commented 7 years ago

I'm writing configurable Kernels and Expected Improvement at the minute. Will make a PR in next couple of days!

One thing to note though is that I'm interested in switching to GPFlow for this calculation, as it's more actively maintained.

RobertArbon commented 7 years ago

OK, cool. I'm writing it for my own benefit but it looks like only minor changes would be needed to change to GPFlow as the interfaces look similar.

cxhernandez commented 7 years ago

done in #229