Unconstrained Hyperparameter Optimization

altaetran commented 9 years ago

Hey,

I noticed that the GaussianProcess.optimize_hyperparameters() method can move the parameters into regions where the covariance matrix is not invertible. I think this is because of the lack of invertibility of the resulting matrix. As a result, I am not able to get a correct optimization using this method. I am testing this on the introductory example with a few random data points. Do you think there is a way to safeguard against this? Thanks!

Best,

Han

markchil commented 9 years ago

Han:

Yes, the default bounds are left rather wide so as to be uninformative, but end up running into issues when evaluating the covariance matrix.

I just pushed an update which attempts to get the optimizer to continue in such cases (by returning a log-likelihood of -infinity when an error occurs). But, since the initial guesses for the optimizer are distributed according to the hyperprior, this can end up simply getting the optimizer stuck in an area where it can’t make progress. (The same goes for applying MCMC to marginalize over the hyperparameters.)

You can use an explicit initial guess by setting random_starts=0 when calling optimize_hyperparameters. In this case, the starting point for the optimizer is whatever you set initial_params to when creating the Kernel (or have subsequently set the parameters to manually).

In any case, the best solution in this situation is to make your hyperprior more informative about expected ranges of the hyperparameter. Either use the param_bounds keyword when creating the Kernel, or create a more informative hyperprior in the form of a JointPrior instance to pass into the hyperprior keyword when creating the Kernel.

-Mark

On Aug 11, 2014, at 11:14 PM, altaetran notifications@github.com wrote:

Hey,

I noticed that the GaussianProcess.optimize_hyperparameters() method can move the parameters into regions where the covariance matrix is not invertible. I think this is because of the lack of invertibility of the resulting matrix. As a result, I am not able to get a correct optimization using this method. I am testing this on the introductory example with a few random data points. Do you think there is a way to safeguard against this? Thanks!

Best,

Han

— Reply to this email directly or view it on GitHub.

altaetran commented 9 years ago

I am currently playing around with different optimization techniques for the hyper parameter estimation problem, since I often run into problems where the log likelihood is orders of magnitude lower than previously. I will let you know what works out for me. Thanks.

markchil commented 9 years ago

I am not sure what you mean by “order of magnitude lower than previously”. If you clarify I can take a look at it and offer some advice.

If your model isn’t too expensive (i.e., you don’t have too many points), you can also use GaussianProcess.sample_hyperparameter_posterior() to get a picture of the parameter space to see if there are multiple modes, etc. You can set plot_posterior=True to make a scatterplot matrix. You will need the packages emcee and triangle (triangle is listed as triangle_plot on PyPI).

-Mark

On Aug 15, 2014, at 6:19, altaetran notifications@github.com wrote:

I am currently playing around with different optimization techniques for the hyper parameter estimation problem, since I often run into problems where the log likelihood is orders of magnitude lower than previously. I will let you know what works out for me. Thanks.

— Reply to this email directly or view it on GitHub.

altaetran commented 9 years ago

Sure, for instance, the updated log likelihood may be ~ -20000 when the likelihood with the initial parameters is -50. I think the way to deal with this is to include the initial parameters as a seed point. Furthermore, I think there should be a difference between optimizing hyper parameters from scratch, and optimizing hyper parameters when the data has been augmented. In the latter case, it makes more sense to use the previous initial parameters only, and then to optimize around that point using an optimization scheme that is not sensitive to small local minima.

markchil commented 9 years ago

Ah, got it. There are two things to be done about this:

1.) If you set random_starts=0 when calling optimize_hyperparameters, then the current state of the hyperparameters will be used for the initial guess. This should solve your case where you augment the data after having already optimized the hyperparameters. Let me know if you find an optimizer that behaves better than SLSQP for this case — that one seems to do the best job while obeying the bounds from what I’ve found (and I kept having issues with unbounded optimizers getting trapped in unphysical regions), but I’ll be curious to hear about your experience on a different problem.

2.) You can improve the performance by setting a more informative hyperprior: either set param_bounds when creating the Kernel, or pass an explicit JointPrior instance when creating the Kernel. (The hyperprior is taken to be uniform over param_bounds if none is specified.)

Basically, what happens if random_starts != 0 is that the optimizer is started at random places randomly drawn from the hyperprior. This was an attempt at making a simple type of global optimizer, since I found that the results with the optimizers in scipy.optimize.minimize() seemed to depend a little too much on the initial guess for my taste.

Also, I’m sure you’ve noticed this, but bear in mind that the objective function for the minimizer returns -1*(log likelihood), so what the minimizer returns should be more negative than the starting point. The correct way to access it is in the attribute GaussianProcess.ll

-Mark

On Aug 15, 2014, at 18:43, altaetran notifications@github.com wrote:

Sure, for instance, the updated log likelihood may be ~ -20000 when the likelihood with the initial parameters is -50. I think the way to deal with this is to include the initial parameters as a seed point. Furthermore, I think there should be a difference between optimizing hyper parameters from scratch, and optimizing hyper parameters when the data has been augmented. In the latter case, it makes more sense to use the previous initial parameters only, and then to optimize around that point using an optimization scheme that is not sensitive to small local minima.

— Reply to this email directly or view it on GitHub.

markchil / gptools

Unconstrained Hyperparameter Optimization #6