pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.05k stars 390 forks source link

[Bug] GP posterior samples return NaN grad #567

Closed saitcakmak closed 3 years ago

saitcakmak commented 3 years ago

πŸ› Bug

With certain combinations of GP model and posterior sampling points, model.posterior(X).rsample(...) returns proper samples, however the grad w.r.t. X returns NaN. The error does not get caught until much later when the grad is actually used for optimization and the updated solution is all NaN.

We traced this issue down to the root decomposition of the covariance matrix. model.posterior(X).mvn.lazy_covariance_matrix has proper grad, however, model.posterior(X).mvn.lazy_covariance_matrix.root_decomposition().root returns grad NaN.

More testing shows that (at least for the example below) this does not happen when using pytorch 1.6.0, but happens when using the nightly release, e.g. 1.7.0.dev202. This points to an upstream issue, however since the code is based on botorch / gpytorch this seemed like a better place to raise it.

Update: The issue does not seem to be pytorch version. I am still getting NaN grads with different GP / sampling point combinations, using torch 1.6.0.

To reproduce

The data to construct the GP model and X: NaN_error.zip

Code snippet to reproduce

import torch
from botorch.models import SingleTaskGP
from botorch.models.transforms import Standardize
from gpytorch.lazy import delazify

data = torch.load("error_data.pt")
train_inputs = data["train_inputs"]
train_targets = data["train_targets"]
state_dict = data["state_dict"]
fant_X = data["fant_X"]
sampler = data["sampler"]
X = data["X"]
# works with torch 1.6.0, fails with torch nightly release - 1.7.0.dev202

model = SingleTaskGP(train_inputs, train_targets, outcome_transform=Standardize(m=1))
model.load_state_dict(state_dict)
fantasy_model = model.fantasize(fant_X, sampler)
posterior = fantasy_model.posterior(X)
samples = posterior.rsample(sample_shape=torch.Size([5]))
grad = torch.autograd.grad(torch.sum(samples), X, retain_graph=True)
# samples are generated, however, the grad is NaN
print("NaN samples: %s" % torch.any(torch.isnan(samples)))
print("NaN grad: %s" % torch.any(torch.isnan(grad[0])))

# we can trace it down to root decomposition
covar = posterior.mvn.lazy_covariance_matrix
print("NaN covariance grad: %s" % torch.any(torch.isnan(
    torch.autograd.grad(torch.sum(delazify(covar)), X, retain_graph=True)[0]
)))
root = covar.root_decomposition().root
print("NaN covariance root grad: %s" % torch.any(torch.isnan(
    torch.autograd.grad(torch.sum(delazify(root)), X, retain_graph=True)[0]
)))

Stack trace/error message With torch 1.6.0, this will print False for all four print statements. With torch nightly release, this will print False, True, False, True; meaning that the samples have NaN grad, and that this is due to the root decomposition of covariance matrix returning NaN grad.

Update: different GP / sampling combinations produce the same error with torch 1.6.0 as well.

Expected Behavior

Grad should not be NaN.

System information

Please complete the following information: BoTorch: tested both 3.1.0 and master GPyTorch: tested both 1.2.0 and master PyTorch: Works with 1.6.0, fails with nightly release 1.7.0.dev202. Python: tested with both 3.7 and 3.8 Computer OS: tested with Ubuntu 16.04 and 20.04

cc @RaulAstudillo06

saitcakmak commented 3 years ago

Update: The issue does not seem to be pytorch version. I am still getting NaN grads with different GP / sampling point combinations with torch 1.6.0

Balandat commented 3 years ago

Thanks for flagging this. I'll try to take a look at this later today. In the meantime, do you have a sense whether the gradient should exist at the test points? I.e. it would not if there were repeated test points (in which case the posterior covariance would be singular). The other explanation would be that there is some numerical mayhem happening in the grad computation. Did you try running this in double dtype to see whether it reproduces there? Do you know whether this happens on both CPU and GPU?

saitcakmak commented 3 years ago

Thanks for looking into this.

do you have a sense whether the gradient should exist at the test points

In the example above, and some others where I observed this issue, the test points are a subset of training points. However, the test points are all unique within any given batch. It is also worth noting that I observed this issue when the test points were not a subset of the training points. In either scenario, I also have successful runs where the gradient is never NaN. Since the covariance matrix is not singular and has grad, I expect the gradient of the root matrix to also exist (it is possible that I am missing something here).

Did you try running this in double dtype to see whether it reproduces there? Do you know whether this happens on both CPU and GPU?

The example above works with double (i.e. no NaN), though I don't know if this generalizes. I get NaN when I run the example above on GPU as well.

FWIW, the experiments that produce this error are a few months old. Back then, I do not recall this error being a major issue. Any error I ran into back then would be resolved by restarting the experiment (using the stored GP train data). This is not the case with this NaN error, and the experiment raises the same error after restarting. So, it is possible that this was introduced sometime over the summer.

Balandat commented 3 years ago

I see. The fact that this works in double suggests that these may be numerical issues in the gradient computation, likely because of ill-conditioned matrices. I'll double check what exactly is going on, thanks for the nice repro.

Balandat commented 3 years ago

I finally got to take a look at this - this seems to be numerical issues stemming from very large standardization factors in the Standardize OutcomeTransform that is part of the model.

Excerpt from the state dict:

 'outcome_transform.means': tensor([[2790.7327]], dtype=torch.float64),
 'outcome_transform.stdvs': tensor([[3894.9331]], dtype=torch.float64),
 'outcome_transform._stdvs_sq': tensor([[15170504.]], dtype=torch.float64)}

The mean and stdv are huge, so likely there is some overflow happening when doing the backward pass through the un-standardization while computing the posterior. If I get this back to reasonable levels then there aren't any gradient issues.

FWIW, the unstandardizuing is happening in https://github.com/pytorch/botorch/blob/72d81871f90a3baa206d1de836b01869e55fb24b/botorch/models/transforms/outcome.py#L275-L299. There are probably some ways we can improve numerical stability of this, right now the implementation is not particularly sophisticated. But this should be improved after I get to finish up https://github.com/cornellius-gp/gpytorch/pull/1083

saitcakmak commented 3 years ago

Thanks for looking into this. It is nice to know the exact cause of the issue. That also explains why it doesn't happen when using dtype=torch.double.

The example above is from a modified Branin function (it's a wild one), which ranges from 0 to 75k at the extremes. I am guessing the overflow is due to 'outcome_transform._stdvs_sq': tensor([[15170504.]], dtype=torch.float64). This makes me wonder whether a simple scaling before Standardize() would help with this. If we had a simple transform Y = Y / scale where, e.g. scale = torch.pow(torch.tensor([10]), Y.mean().log10().floor(), 'outcome_transform._stdvs_sq' would be upper bounded by 100, which may prevent such overflow.

Balandat commented 3 years ago

You mean make that an actual OutcomeTransform or do this in an ad-hoc fashion before consuming the data? The latter is probably easiest and should be just fine if you're interested in finding the optimum and don't necessarily need the model to operate on unscaled outcomes. The built-in Standardize outcome transform is primarily to make sure that the prior on the noise and outputscale in the GP model are appropriate.

saitcakmak commented 3 years ago

I was thinking of this as a modification to Standardize, so that a person naively using Standardize across the board without any pre-processing (e.g. me before opening this issue) would not run into the same overflow issue. I guess it could also be implemented as a separate OutcomeTransform and used via ChainedOutcomeTransform, but this would require the user to explicitly specify it.

I took at look at the implementation of Standardize. It shouldn't be too much work to implement this. I could open a PR if you think this is a good fix.

saitcakmak commented 3 years ago

On a second thought, I am not sure if implementing such a scaling as part of Standardize would fix this. If the root_decomposition is applied to untransformed matrix, then this may not have any effect. I don't know how it works internally with lazy tensors. I will run some tests and update this accordingly.

saitcakmak commented 3 years ago

As I suspected, modifying Standardize to implement such a scaling did not change the behavior. Scaling as a pre-processing step appears to work though.

Balandat commented 3 years ago

Yeah not sure if there is an elegant solution here - if the scaling factors are this large we will run into this with float32. As I said, there may be a numerically more robust way to perform the scaling in the transform, but I'd like to get the alternative representation for the MTMVNs in first.

I'll close this issue for now.

ruhanaazam commented 10 months ago

As I suspected, modifying Standardize to implement such a scaling did not change the behavior. Scaling as a pre-processing step appears to work though.

I'm wondering if there have been any updates or easy work arounds to improve numerical stability?

I'm have a similar issue where the standardization factors are blowing up. Currently, I'm normalizing the inputs and applying Standarize as the OutputTransform.

saitcakmak commented 10 months ago

Hi @RuhanaAzam. Can you open a new issue with a reproducible code example for us to investigate this? This issue has been closed for a bit too long for us to resurrect it here.