pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.11k stars 406 forks source link

[Bug] Parameter Constraints Do Not Work #2542

Open kayween opened 2 months ago

kayween commented 2 months ago

šŸ› Bug

botorch.models.utils.gpytorch_modules implements a few utility functions that return kernels and likelihoods. Those functions should enforce constraints on kernel.lengthscale and likelihood.noise to make sure they are always positive.

However, the constraints do not work in some cases.

To reproduce

The following is a minimal working example showing that the parameter constraint does not work properly.

import torch
from botorch.models.utils.gpytorch_modules import get_gaussian_likelihood_with_gamma_prior

# By default the noise has a constraint GreaterThan(1.000E-04)
likelihood = get_gaussian_likelihood_with_gamma_prior()

# Let's say the gradient of the raw noise is 5 at some point during hyperparameter optimization
# In general, the gradient could be any values.
likelihood.raw_noise.grad = torch.tensor([5.])

# Do a single step gradient descent on the noise
with torch.no_grad():
    for param in likelihood.parameters():
        param -= param.grad

# tensor([-3.0000], requires_grad=True) violates the constraint!
print(likelihood.noise)

# Let's evaluate the log prior of the likelihood noise as in gpytorch.mlls._approximate_mll
# https://github.com/cornellius-gp/gpytorch/blob/8825cdd7abd1db7dea5803265067d598f21d6962/gpytorch/mlls/_approximate_mll.py#L70-L71
name, module, prior, closure, _ = next(likelihood.named_priors())
log_prior = prior.log_prob(closure(module))

print("name {:s}, log prior {:f}".format(name, log_prior.item()))

The following is the output. The log prior is NaN, because the noise is outside the support of Gamma distributions

Parameter containing:
tensor([-3.0000], requires_grad=True)
noise_covar.noise_prior, log prior nan

Expected Behavior

  1. likelihood.noise is supposed to be greater than 1e-4.
  2. log_prior should not be NaN.

The above two should hold in all cases for any gradient values.

System information

Additional context

I am working on aepsych (heavily relies on botorch) where we use similar outputscale/lengthscale priors. I was fitting a GP model on a synthetic dataset and had NaN issues during hyperparameter optimization (I was using Adam). But those NaN issues might break LBFGS as well, e.g., line search failures.

Those NaN issues are due to the argument transform=None.

https://github.com/pytorch/botorch/blob/85364681a84ac2276b423fe575f33db1e1df311b/botorch/models/utils/gpytorch_modules.py#L63-L71

GPyTorch implements the constraint by a softplus transformation. However, if we override the argument by transform=None, then self.enforce=None. As a result, no transformation is applied and the constraint is not enforced.

https://github.com/cornellius-gp/gpytorch/blob/8825cdd7abd1db7dea5803265067d598f21d6962/gpytorch/constraints/constraints.py#L173-L175

In most cases, the prior pushes the hyperparameter towards the mode of the prior distribution. Thus, this NaN issue does not happen very often. However, I believe this could lead to unintended behavior, e.g., line search failure and early termination of LBFGS.

saitcakmak commented 2 months ago

Hi @kayween. Thanks for reporting. This issue would be better to raise on the GPyTorch side since the underlying logic is purely controlled by GPyTorch. We just package these modules together for convenience.

However, I believe this could lead to unintended behavior, e.g., line search failure and early termination of LBFGS.

We do see occasional line search failures during model fitting. If this is the root cause, it'd be great to see it fixed :)

saitcakmak commented 2 months ago

I am also not sure if your gradient step is the correct way to update the parameter values. When you do

with torch.no_grad():
    for param in likelihood.parameters():
        param -= param.grad

you're updating the raw_noise tensor in-place, skipping any constraint logic that is enforced by the noise setter in GPyTorch. If I replace this with a non-in-place operation,

likelihood.noise = likelihood.noise - likelihood.noise.grad

it will trigger the setter of HomoskedasticNoise.noise, which leads to this error:

ValueError:Invalid input value for prior noise_prior. Error:
Expected value argument (Parameter of shape (1,)) to be within the support (GreaterThanEq(lower_bound=0.0)) of the distribution GammaPrior(), but found invalid values:
Parameter containing:
tensor([-3.0000], requires_grad=True)
kayween commented 2 months ago

Hi @saitcakmak

you're updating the raw_noise tensor in-place, skipping any constraint logic that is enforced by the noise setter in GPyTorch

PyTorch tensors that require gradients do need in-place updates during gradient steps. So param -= param.grad is the right way to do gradient descent.

Also, we do want to directly update the raw noise, as opposed to updating likelihood.noise. The way that gpytorch enforces positivity constraint is through a softplus transformation:

noise = torch.nn.functional.softplus(raw_noise) + 1e-4

So raw_noise is unconstrained and we can do any gradient updates on it. After the softplus transformation, noise is always positive satisfying the constraint.

kayween commented 2 months ago

@saitcakmak

The reason why the constraint is not enforced is due to this line in BoTorch. https://github.com/pytorch/botorch/blob/85364681a84ac2276b423fe575f33db1e1df311b/botorch/models/utils/gpytorch_modules.py#L68

The argument transform=None overrides the default gpytorch transformation, which disables the constraint. Removing this line does fix the NaN issues in my problem. So it is a BoTorch side issue.

We do see occasional line search failures during model fitting. If this is the root cause, it'd be great to see it fixed :)

I can attach a minimal working example later showing that removing this argument fixes NaN issues. And then we can test if this also resolves occasional line search failures.

saitcakmak commented 2 months ago

Ah, you're right. I always find this parameter / raw parameter setup confusing.

Going back through code history, looks like the noise constraint on the likelihood was added by @Balandat very long time ago (https://github.com/pytorch/botorch/commit/5333d5b53f6f9d6fd32c5d844cf7d383b227d977), along with transform=None from the start. I wonder if he remembers why it was setup this way.

saitcakmak commented 2 months ago

https://github.com/pytorch/botorch/pull/2544 would remove transform=None across the board. I'll check with others to see if there are any concerns with removing this.

kayween commented 2 months ago

Going through the code history https://github.com/pytorch/botorch/commit/5333d5b53f6f9d6fd32c5d844cf7d383b227d977, it looks like there were some discussions on this previously https://github.com/cornellius-gp/gpytorch/issues/629

Note that constraints that have a transform that is not None automatically enforces the constraint by using a transform. This can be an issue for quasi 2nd order optimizers though b/c the objective becomes flat when overshooting past the effective constraint in the line search. Hence not doing the transform and imposing an explicit constraint is preferred. It may also be beneficial to use the transform in conjunction with an explicit bound - will have to evaluate that more.

Based on the commit message, it seems that transform=None is intentional so that the landscape is nicer to LBFGS-B. Then, we rely on LBFGS-B (not LBFGS) to handle the constraint explicitly. The catch is that we cannot use (vanilla) Adam/SGD anymore (which do not handle the constraint explicitly). (I was using Adam to train the model and that's why I encountered the NaN issue..). So it might be a feature instead of a bug?

In any case, @Balandat probably has more comments on this.

Nevertheless, it is probably good to have consistency? Some modules have transform=None while some do not, e.g., https://github.com/pytorch/botorch/blob/85364681a84ac2276b423fe575f33db1e1df311b/botorch/models/utils/gpytorch_modules.py#L41-L50

Balandat commented 2 months ago

Based on the commit message, it seems that transform=None is intentional so that the landscape is nicer to LBFGS-B. Then, we rely on LBFGS-B (not LBFGS) to handle the constraint explicitly. The catch is that we cannot use (vanilla) Adam/SGD anymore (which do not handle the constraint explicitly). (I was using Adam to train the model and that's why I encountered the NaN issue..). So it might be a feature instead of a bug?

Yes the idea was indeed that when using L-BFGS-B we want to impose explicit constraints on the parameters. The easiest way to do this is to not use a transform in which case the can just constrain the "raw_x" value directly. If gpytorch applies a transformation to enforce the constraint then you can end up with issues as mentioned above.

Now this decision was made a long time ago and so we could potentially revisit this. If always using a transform works fine with L-BFGS-B then great. We may consider also using an explicit constraint in addition to the transform as mentioned above, but then we'd have to somehow apply the constraint in the transformed space in the optimizer. Yet another option could be to adaptively change the nature of the constraint (set the transform to None dynamically based on the optimizer used), but that seems somewhat challenging and also quite brittle.

saitcakmak commented 2 weeks ago

I came across this again yesterday and looked into the GPyTorch constraints a bit more. Seems like we could disable the application of constraints by mocking a simple property:

with patch("gpytorch.constraints.constraints.Interval.enforced", False):
    ...

All constraints I could find subclass Interval and gatekeep constraint application behind self.enforced.