Open kayween opened 2 months ago
Hi @kayween. Thanks for reporting. This issue would be better to raise on the GPyTorch side since the underlying logic is purely controlled by GPyTorch. We just package these modules together for convenience.
However, I believe this could lead to unintended behavior, e.g., line search failure and early termination of LBFGS.
We do see occasional line search failures during model fitting. If this is the root cause, it'd be great to see it fixed :)
I am also not sure if your gradient step is the correct way to update the parameter values. When you do
with torch.no_grad():
for param in likelihood.parameters():
param -= param.grad
you're updating the raw_noise
tensor in-place, skipping any constraint logic that is enforced by the noise setter in GPyTorch. If I replace this with a non-in-place operation,
likelihood.noise = likelihood.noise - likelihood.noise.grad
it will trigger the setter of HomoskedasticNoise.noise
, which leads to this error:
ValueError:Invalid input value for prior noise_prior. Error:
Expected value argument (Parameter of shape (1,)) to be within the support (GreaterThanEq(lower_bound=0.0)) of the distribution GammaPrior(), but found invalid values:
Parameter containing:
tensor([-3.0000], requires_grad=True)
Hi @saitcakmak
you're updating the raw_noise tensor in-place, skipping any constraint logic that is enforced by the noise setter in GPyTorch
PyTorch tensors that require gradients do need in-place updates during gradient steps. So param -= param.grad
is the right way to do gradient descent.
Also, we do want to directly update the raw noise, as opposed to updating likelihood.noise
. The way that gpytorch enforces positivity constraint is through a softplus transformation:
noise = torch.nn.functional.softplus(raw_noise) + 1e-4
So raw_noise
is unconstrained and we can do any gradient updates on it. After the softplus transformation, noise
is always positive satisfying the constraint.
@saitcakmak
The reason why the constraint is not enforced is due to this line in BoTorch. https://github.com/pytorch/botorch/blob/85364681a84ac2276b423fe575f33db1e1df311b/botorch/models/utils/gpytorch_modules.py#L68
The argument transform=None
overrides the default gpytorch transformation, which disables the constraint. Removing this line does fix the NaN issues in my problem. So it is a BoTorch side issue.
We do see occasional line search failures during model fitting. If this is the root cause, it'd be great to see it fixed :)
I can attach a minimal working example later showing that removing this argument fixes NaN issues. And then we can test if this also resolves occasional line search failures.
Ah, you're right. I always find this parameter / raw parameter setup confusing.
Going back through code history, looks like the noise constraint on the likelihood was added by @Balandat very long time ago (https://github.com/pytorch/botorch/commit/5333d5b53f6f9d6fd32c5d844cf7d383b227d977), along with transform=None
from the start. I wonder if he remembers why it was setup this way.
https://github.com/pytorch/botorch/pull/2544 would remove transform=None
across the board. I'll check with others to see if there are any concerns with removing this.
Going through the code history https://github.com/pytorch/botorch/commit/5333d5b53f6f9d6fd32c5d844cf7d383b227d977, it looks like there were some discussions on this previously https://github.com/cornellius-gp/gpytorch/issues/629
Note that constraints that have a transform that is not
None
automatically enforces the constraint by using a transform. This can be an issue for quasi 2nd order optimizers though b/c the objective becomes flat when overshooting past the effective constraint in the line search. Hence not doing the transform and imposing an explicit constraint is preferred. It may also be beneficial to use the transform in conjunction with an explicit bound - will have to evaluate that more.
Based on the commit message, it seems that transform=None
is intentional so that the landscape is nicer to LBFGS-B. Then, we rely on LBFGS-B (not LBFGS) to handle the constraint explicitly. The catch is that we cannot use (vanilla) Adam/SGD anymore (which do not handle the constraint explicitly). (I was using Adam to train the model and that's why I encountered the NaN issue..). So it might be a feature instead of a bug?
In any case, @Balandat probably has more comments on this.
Nevertheless, it is probably good to have consistency? Some modules have transform=None
while some do not, e.g.,
https://github.com/pytorch/botorch/blob/85364681a84ac2276b423fe575f33db1e1df311b/botorch/models/utils/gpytorch_modules.py#L41-L50
Based on the commit message, it seems that transform=None is intentional so that the landscape is nicer to LBFGS-B. Then, we rely on LBFGS-B (not LBFGS) to handle the constraint explicitly. The catch is that we cannot use (vanilla) Adam/SGD anymore (which do not handle the constraint explicitly). (I was using Adam to train the model and that's why I encountered the NaN issue..). So it might be a feature instead of a bug?
Yes the idea was indeed that when using L-BFGS-B we want to impose explicit constraints on the parameters. The easiest way to do this is to not use a transform in which case the can just constrain the "raw_x" value directly. If gpytorch applies a transformation to enforce the constraint then you can end up with issues as mentioned above.
Now this decision was made a long time ago and so we could potentially revisit this. If always using a transform works fine with L-BFGS-B then great. We may consider also using an explicit constraint in addition to the transform as mentioned above, but then we'd have to somehow apply the constraint in the transformed space in the optimizer. Yet another option could be to adaptively change the nature of the constraint (set the transform to None
dynamically based on the optimizer used), but that seems somewhat challenging and also quite brittle.
I came across this again yesterday and looked into the GPyTorch constraints a bit more. Seems like we could disable the application of constraints by mocking a simple property:
with patch("gpytorch.constraints.constraints.Interval.enforced", False):
...
All constraints I could find subclass Interval
and gatekeep constraint application behind self.enforced
.
š Bug
botorch.models.utils.gpytorch_modules
implements a few utility functions that return kernels and likelihoods. Those functions should enforce constraints onkernel.lengthscale
andlikelihood.noise
to make sure they are always positive.However, the constraints do not work in some cases.
To reproduce
The following is a minimal working example showing that the parameter constraint does not work properly.
The following is the output. The log prior is NaN, because the noise is outside the support of Gamma distributions
Expected Behavior
likelihood.noise
is supposed to be greater than1e-4
.log_prior
should not be NaN.The above two should hold in all cases for any gradient values.
System information
Additional context
I am working on aepsych (heavily relies on botorch) where we use similar outputscale/lengthscale priors. I was fitting a GP model on a synthetic dataset and had NaN issues during hyperparameter optimization (I was using Adam). But those NaN issues might break LBFGS as well, e.g., line search failures.
Those NaN issues are due to the argument
transform=None
.https://github.com/pytorch/botorch/blob/85364681a84ac2276b423fe575f33db1e1df311b/botorch/models/utils/gpytorch_modules.py#L63-L71
GPyTorch implements the constraint by a softplus transformation. However, if we override the argument by
transform=None
, thenself.enforce=None
. As a result, no transformation is applied and the constraint is not enforced.https://github.com/cornellius-gp/gpytorch/blob/8825cdd7abd1db7dea5803265067d598f21d6962/gpytorch/constraints/constraints.py#L173-L175
In most cases, the prior pushes the hyperparameter towards the mode of the prior distribution. Thus, this NaN issue does not happen very often. However, I believe this could lead to unintended behavior, e.g., line search failure and early termination of LBFGS.