pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.08k stars 397 forks source link

Questions about priors in SingleTaskGP model #616

Closed grmaier closed 3 years ago

grmaier commented 3 years ago

I am trying to understand how the SingleTaskGP model works. If I read the doc and the code correctly, the assumptions on the priors of the kernel in SingleTaskGP are as follows: Let f be the black-box function, which is to be optimized, and x be a point in the domain. Then f|x has normal distribution with mean 0 and variance k(x,x), where k is the Matern(5/2) kernel, i.e. k(x,x') = \theta_0^2 \exp(-\sqrt{5}r) (1 + \sqrt{5}r + \frac{5}{3}r^2) with r = \frac{|x - x'|}{\theta_1}, and with output scale parameter \theta_0^2 with distribution \Gamma(2.0, 0.15) and length scale parameter \theta_1 with distribution \Gamma(3.0, 6.0). Moreover, we assume a homoskedastic noise level to obtain observations y = f(x) + \epsilon with \epsilon being normally distributed with mean 0 and variance \sigma^2 with distribution \Gamma(1.1, 0.05).

This leads me to two questions: (1) Why do we assume that the above hyperparameters are gamma distributed and what explains the choice of the parameters in the gamma distributions? (2) In the documentation it's said that SingleTaskGP assumes noiseless observations. I guess that means that we measure f(x) directly instead of the corrupted observation y. But what is the reason then why we assume a homoskedastic noise level \epsilon in the first place? Wouldn't it make more sense to assume no noise level at all in this case? Maybe I misunderstand how the homoskedastic noise comes into play.

Balandat commented 3 years ago
  1. Putting a prior on the hyperparameters is a standard approach to Bayesian statistics. The prior is our belief (I won't get into the Frequentist vs. Bayesian discussion here) of the distribution of the hyperparameters prior to seeing any data. If we were fully Bayesian we would in fact estimate the posterior hyperparameter distribution given the data we've seen (typically done via MCMC sampling), but that's computationally expensive and in many situations not needed - Instead we optimize the marginal log likelihood, resulting in what's called a MAP estimate for the hyperparameters. You can view the chosen priors as rather mundane practical choices that work reasonably well in practice. One thing to note is that we usually assume that the data is normalized to the unit hypercube, so lengthscales >> 1 don't make much sense, and you'll note that the chosen prior for the lenghtscales has very little probability mass at large values. So basically, these priors have been chosen empirically since they have worked quite well for our applications (modeling results of large scale A/B tests). However, depending on the application, other priors may be more appropriate (especially regarding the one on the noise). In an ideal world, you would have some understanding of the particular problem you're trying to model and choose priors accordingly (this is kind of an art).
  2. Would you mind pointing me to where exactly in the documentation this is said? As you correctly suggest, SingleTaskGP in the canned version that we have in BoTorch assumes an unknown homoskedastic noise level that we infer together with the other model parameters.
grmaier commented 3 years ago

Thanks very much for your detailed answer concerning my first question!

Regarding 2.: I was refering to what is said here https://botorch.org/docs/models under "Single-Task GPs". I guess what I don't understand is when to use SingleTaskGP and when FixedNoiseGP. If I know that my data is corrupted with some homoskedatic (known) noise \espilon, I would use FixedNoiseGP as I can only make corrupted observations y = f(x) +\epsilon. If I know that I can make exact observations, i.e. measure f(x) exactly, then I would use SingleTaskGP but without a prior homoskedatic noise level. So I don't understand why the provided version of SingleTaskGP seems to be a hybrid between these to approaches.

Balandat commented 3 years ago

Ah I see. So

So for the noiseless case, where you do know that there is no observation noise, we typically still use a very small noise level for numerical stability reasons. To achieve this you have two options (well you have more if you want to write your own model, but let's focus with the canned models):

grmaier commented 3 years ago

Got it! Thanks for your quick and detailed response!

grmaier commented 3 years ago

@Balandat I have one further question: Can you explain to me when to use the model HeteroskedasticSingleTaskGP? If I know the noise which is corrupting my data, I would use the model FixedNoiseGP. In HeteroskedasticSingleTaskGP the noise is modeled by another SingleTaskGP model. In the documentation (https://botorch.org/api/_modules/botorch/models/gp_regression.html#HeteroskedasticSingleTaskGP) it is said that "this allows the likelihood to make out-of-sample predictions for the observation noise levels." I am not entirely sure what is meant by that. I guess that HeteroskedasticSingleTaskGP should be used when I know that my observations are corrupted by heteroskedastic noise, but the noise is unknown? But on the other hand, HeteroskedasticSingleTaskGP requires an additional argument train_Yvar of observed measurement noise, so, as far as I understand, the noise has in fact to be known?

Balandat commented 3 years ago

HeteroskedasticSingleTaskGP employs another GP model to model the noise. As you observed, it relies on noise observations (train_Yvar) to fit this model. There are two benefits to this:

  1. if your noise observations themselves are noisy, the model is able to regularized those rather than just using the noisy noisy observations as verbatim. This can result in a better model in such situations.
  2. as you found in the docs, it "allows the likelihood to make out-of-sample predictions for the observation noise levels". So this is important if you try to understand the observation noise you would face at new points that you have not evaluated yet. This is useful for lookahead Bayesian Optimization approaches (such as the Knowledge Gradient) where one has to reason about the amount of information provided by new observations (and thus the observation uncertainty of these observations, rather than just the uncertainty in the latent function estimate).

If you don't need either of these, it's perfectly fie to use FixedNoiseGP. In fact, it's preferable, since the model has fewer parameters to estimate which simplifies inference.