pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.04k stars 389 forks source link

[Bug] Cholesky is NaN while fitting gpytorch model. #337

Closed saitcakmak closed 4 years ago

saitcakmak commented 4 years ago

🐛 Bug

While fitting a gpytorch model to some training data (using fit_gpytorch_model()), I get the following runtime error (see [facebook/Ax#99)], #179 for related issues). Strangely, this happens when running the code on my Mac and not when running it on an Ubuntu machine, despite using the same seed, exact same training data, and same software version. It is worth noting that I observe the same behavior across a range of input data, e.g. when using other botorch test functions. In almost every case, the code runs on the Ubuntu machine without any issues.

Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 24, in psd_safe_cholesky L = torch.cholesky(A, upper=upper, out=out) RuntimeError: cholesky_cpu: U(2,2) is zero, singular U.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/saitcakmak/PycharmProjects/BoRisk/full_loop.py", line 60, in fit_gpytorch_model(mll) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/fit.py", line 101, in fit_gpytorchmodel mll, = optimizer(mll, track_iterations=False, kwargs) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/optim/fit.py", line 225, in fit_gpytorch_scipy callback=cb, File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/_minimize.py", line 600, in minimize callback=callback, options) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb f, g = func_and_grad(x) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad f = fun(x, args) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 326, in function_wrapper return function((wrapper_args + args)) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 64, in call fg = self.fun(x, args) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/optim/fit.py", line 283, in _scipy_objective_and_grad raise e # pragma: nocover File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/optim/fit.py", line 278, in _scipy_objective_and_grad loss = -mll(args).sum() File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/module.py", line 24, in call outputs = self.forward(*inputs, *kwargs) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 51, in forward res = output.log_prob(target) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 107, in log_prob return super().log_prob(value) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/distributions/multivariate_normal.py", line 208, in log_prob M = _batch_mahalanobis(self._unbroadcasted_scale_tril, diff) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 50, in _unbroadcasted_scale_tril ust = delazify(self.lazy_covariance_matrix.cholesky()) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 738, in cholesky res = self._cholesky() File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g add_to_cache(self, cache_name, method(self, args, **kwargs)) File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 413, in _cholesky cholesky = psd_safe_cholesky(evaluated_mat).contiguous() File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 30, in psd_safe_cholesky f"cholesky_cpu: {isnan.sum().item()} of {A.numel()} elements of the {A.shape} tensor are NaN." gpytorch.utils.errors.NanError: cholesky_cpu: 90 of 100 elements of the torch.Size([10, 10]) tensor are NaN.

To reproduce

Here is the relevant code snippet. As I mentioned, I cannot reproduce this on my Ubuntu machine. Code snippet to reproduce

# Your code goes here
# Please make sure it does not require any external dependencies

import torch
from torch import Tensor
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_model
from gpytorch.mlls import ExactMarginalLogLikelihood
import gpytorch
from torch.distributions import Uniform, Gamma
from typing import Union
from botorch.optim import optimize_acqf
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.constraints.constraints import GreaterThan
from gpytorch.priors.torch_priors import GammaPrior
from botorch.test_functions import Hartmann, ThreeHumpCamel, Beale, Branin, Powell
from standardized_function import StandardizedFunction

# fix the seed for testing
torch.manual_seed(0)

# Initialize the test function
noise_std = 0.1  # observation noise level
function = StandardizedFunction(Powell(noise_std=noise_std))

d = function.dim  # dimension of train_X
dim_w = 1  # dimension of w component
n = 2 * d + 2  # training samples
dim_x = d - dim_w  # dimension of the x component
train_X = torch.rand((n, d))
train_Y = function(train_X)

"""
StandardizedFunction is a class that transforms the domain of the SyntheticTestFunction to the unit hypercube. Here is the exact training data taken from the debug screen. I verified that this is the same across the two machines as well.

train_X
tensor([[0.4963, 0.7682, 0.0885, 0.1320],
        [0.3074, 0.6341, 0.4901, 0.8964],
        [0.4556, 0.6323, 0.3489, 0.4017],
        [0.0223, 0.1689, 0.2939, 0.5185],
        [0.6977, 0.8000, 0.1610, 0.2823],
        [0.6816, 0.9152, 0.3971, 0.8742],
        [0.4194, 0.5529, 0.9527, 0.0362],
        [0.1852, 0.3734, 0.3051, 0.9320],
        [0.1759, 0.2698, 0.1507, 0.0317],
        [0.2081, 0.9298, 0.7231, 0.7423]])
train_Y
tensor([[ 9581.6172],
        [ 8215.9570],
        [  426.3773],
        [ 4815.8296],
        [ 7884.1069],
        [ 2833.5762],
        [ 6308.6538],
        [20651.4551],
        [  553.5855],
        [ 7070.3433]])
"""

# construct and fit the GP
# a more involved prior to set a significant lower bound on the noise. Significantly speeds up computation.
noise_prior = GammaPrior(1.1, 0.5)
noise_prior_mode = (noise_prior.concentration - 1) / noise_prior.rate
likelihood = GaussianLikelihood(
    noise_prior=noise_prior,
    batch_shape=[],
    noise_constraint=GreaterThan(
        0.05,  # minimum observation noise assumed in the GP model
        transform=None,
        initial_value=noise_prior_mode,
    ),
)
gp = SingleTaskGP(train_X, train_Y, likelihood)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_model(mll)

Stack trace/error message

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 24, in psd_safe_cholesky
    L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: U(2,2) is zero, singular U.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/fit.py", line 101, in fit_gpytorch_model
    mll, _ = optimizer(mll, track_iterations=False, **kwargs)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/optim/fit.py", line 225, in fit_gpytorch_scipy
    callback=cb,
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/_minimize.py", line 600, in minimize
    callback=callback, **options)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 326, in function_wrapper
    return function(*(wrapper_args + args))
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/optimize/optimize.py", line 64, in __call__
    fg = self.fun(x, *args)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/optim/fit.py", line 283, in _scipy_objective_and_grad
    raise e  # pragma: nocover
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botorch/optim/fit.py", line 278, in _scipy_objective_and_grad
    loss = -mll(*args).sum()
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/module.py", line 24, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 51, in forward
    res = output.log_prob(target)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 107, in log_prob
    return super().log_prob(value)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/distributions/multivariate_normal.py", line 208, in log_prob
    M = _batch_mahalanobis(self._unbroadcasted_scale_tril, diff)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 50, in _unbroadcasted_scale_tril
    ust = delazify(self.lazy_covariance_matrix.cholesky())
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 738, in cholesky
    res = self._cholesky()
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 413, in _cholesky
    cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 30, in psd_safe_cholesky
    f"cholesky_cpu: {isnan.sum().item()} of {A.numel()} elements of the {A.shape} tensor are NaN."
gpytorch.utils.errors.NanError: cholesky_cpu: 90 of 100 elements of the torch.Size([10, 10]) tensor are NaN.

Expected Behavior

fit_gpytorch_model() fits the GP hyper parameters and code continues execution, as it does on Ubuntu.

System information

Please complete the following information:

Balandat commented 4 years ago

Interesting. Your example works on my Mac with a conda setup.

To clarify, from the stack trace it seems you are running this on the gpytorch master branch (rather than 0.3.6 release), is that correct? Just making sure the linux/mac comparison is actually apples to apples (same holds for the botorch version, which should be master - the 0.1.4 is quite old, we're planning to put out a new one with a bunch of stability fixes soon).

If it is running the exact same code, the fact that this works on linux but fails on mac (though runs on my mac) makes we wonder if somewhere deep in pytorch this is using different linear algebra methods. It appears you are using homebrew here, which I'm not super familiar with. Your code works fine on my conda install.

If you just pip install torch on mac it doesn't properly link against mkl (at least that used to be the case), so maybe that's the issue you're seeing. To verify this, could you install anaconda (or miniconda), and then install botorch/gpytorch as follows:

conda install pytorch -c pytorch
conda install scipy
pip install --upgrade git+https://github.com/cornellius-gp/gpytorch.git
pip install --upgrade git+https://github.com/pytorch/botorch.git

where the pip should be the conda's pip (i.e. the pip of the active conda env).

saitcakmak commented 4 years ago

Thank you Max! I set up a Conda environment and it seems to have resolved the issue. It probably was some issue with my previous pip installation as you mentioned.

Balandat commented 4 years ago

Glad to hear. Though this makes me wonder if the correctness (rather than just performance) depends on the particular linear algebra implementation under the hood. I'll see if I can repro this issue when using the pip pytorch install.

saitcakmak commented 4 years ago

I actually observed some strange behavior after switching to the Conda environment, which relates to your thought "if the correctness (rather than just performance) depends on the particular linear algebra implementation".

This behavior is a result of using condition_on_observations and fit_gpytorch_model to update and re-fit the GP model. Here are the contour plots of resulting GPs mean and covariance (code is given below). Mac here is using the Conda environment that I set up following your comment above, and Linux is using pip installation from master branches of both botorch and gpytorch. As you can see here, the fit obtained in Mac does not really make sense to me.

Mac: Initial fit: mac_start

Final fit: mac_final

Linux: Initial fit: linux_start

Final fit: linux_end

Code:


import torch
from torch import Tensor
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_model
from gpytorch.mlls import ExactMarginalLogLikelihood
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.constraints.constraints import GreaterThan
from gpytorch.priors.torch_priors import GammaPrior
import numpy as np
import matplotlib.pyplot as plt
from time import sleep

def contour_plotter(model):
    fig, ax = plt.subplots(ncols=2, figsize=(8, 4))
    # fig.tight_layout()
    ax[0].set_title("$\\mu_n$")
    ax[1].set_title("$\\Sigma_n$")
    for x in ax:
        x.scatter(model.train_inputs[0].numpy()[:, 0], model.train_inputs[0].numpy()[:, 1], marker='x', color='black')
        x.set_aspect('equal')
        x.set_xlabel("x")
        x.set_xlim(0, 1)
        x.set_ylim(0, 1)
    plt.show(block=False)

    # plot the mu
    k = 100
    x = torch.linspace(0, 1, k)
    xx, yy = np.meshgrid(x, x)
    xy = torch.cat([Tensor(xx).unsqueeze(-1), Tensor(yy).unsqueeze(-1)], -1)
    means = model.posterior(xy).mean.squeeze().detach().numpy()
    c = ax[0].contourf(xx, yy, means, alpha=0.8)
    plt.colorbar(c, ax=ax[0])

    # plot the Sigma
    x = torch.linspace(0, 1, k)
    xx, yy = np.meshgrid(x, x)
    xy = torch.cat([Tensor(xx).unsqueeze(-1), Tensor(yy).unsqueeze(-1)], -1)
    means = model.posterior(xy).variance.pow(1 / 2).squeeze().detach().numpy()
    c = ax[1].contourf(xx, yy, means, alpha=0.8)
    plt.colorbar(c, ax=ax[1])

    plt.show(block=False)
    plt.pause(0.01)

tx = Tensor([[0.4963, 0.7682],
             [0.0885, 0.1320],
             [0.3074, 0.6341],
             [0.4901, 0.8964],
             [0.4556, 0.6323],
             [0.3489, 0.4017],
             [0.4716, 0.0649],
             [0.4607, 0.1010],
             [0.8607, 0.8123],
             [0.8981, 0.0747],
             [0.6985, 0.2838],
             [0.4153, 0.2880]])

ty = Tensor([[77.3542],
             [136.5441],
             [27.0687],
             [112.9234],
             [43.0725],
             [19.5122],
             [10.5993],
             [10.6371],
             [123.6821],
             [4.9352],
             [26.4374],
             [13.3470]])

# fix the seed for testing
torch.manual_seed(0)

d = 2  # dimension of train_X
dim_w = 1  # dimension of w component
n = 2 * d + 2  # training samples
dim_x = d - dim_w  # dimension of the x component

# construct and fit the GP
# a more involved prior to set a significant lower bound on the noise. Significantly speeds up computation.
noise_prior = GammaPrior(1.1, 0.5)
noise_prior_mode = (noise_prior.concentration - 1) / noise_prior.rate
likelihood = GaussianLikelihood(
    noise_prior=noise_prior,
    batch_shape=[],
    noise_constraint=GreaterThan(
        0.05,  # minimum observation noise assumed in the GP model
        transform=None,
        initial_value=noise_prior_mode,
    ),
)
gp = SingleTaskGP(tx[0: n], ty[0: n], likelihood)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_model(mll)

contour_plotter(gp)
sleep(1)
plt.show()

for i in range(6):
    candidate_point = tx[n + i, :].reshape(1, -1)
    observation = ty[n + i, :].reshape(1, -1)
    gp = gp.condition_on_observations(candidate_point, observation)
    # refit the model
    mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
    fit_gpytorch_model(mll)
    plt.close('all')
    contour_plotter(gp)
    sleep(1)

plt.show()
Balandat commented 4 years ago

Hmm, interesting. For whatever reason the contours don't plot on my setup - what version of matplotlib are you using?

saitcakmak commented 4 years ago

I am using Matplotlib 3.1.1 with TkAgg backend.

Balandat commented 4 years ago

Thanks, Tk did it. I can see this behavior also on my conda setup, the reason seems to be that the inferred noise level is humongous after the first re-fit. Let me figure out why this could be happening here.

Balandat commented 4 years ago

I think the Mac issue here is a red herring, this is really an issue with the assumptions on the priors on the outputscales. Specifically, the default priors assume that the inputs are normalized to [0,1]^d and the outputs are standardized to zero-mean, unit variance.

You can do that manually or use the new OutcomeTransform feature that I landed in #321/#327:

from botorch.models.transforms import Standardize
gp = SingleTaskGP(tx[0: n], ty[0: n], likelihood, outcome_transform=Standardize(m=1))

This works for me without other code changes.

saitcakmak commented 4 years ago

That makes sense. I remember thinking about doing the transformation manually, but I couldn't find a good way of doing it for an arbitrary function. I looked through the diff and the OutcomeTransform feature seems very useful, so, thank you for that!

I am still curious why I could not reproduce this issue on Linux. Do you have any ideas why that might be? I don't run into this issue on Mac if I simply refit the GP from scratch with updated training data either, in case this gives any clues (I'm guessing this is due to warm-starts).

Balandat commented 4 years ago

I don't run into this issue on Mac if I simply refit the GP from scratch with updated training data either, in case this gives any clues (I'm guessing this is due to warm-starts).

Do you mean if you don't use condition_on_observations and manually construct a new model instead? I haven't actually tested this much with the resulting model, we typically use these as temporary objects.

saitcakmak commented 4 years ago

Yes, exactly. Simply construct a new SingleTaskGP using the updated data, then use fit_gpytorch_model() to fit it.

So, in the above code, if I fit the GP using

gp = SingleTaskGP(tx[0: n], ty[0: n], likelihood)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_model(mll)

with n=6, then update it using

candidate_point = tx[n, :].reshape(1, -1)
observation = ty[n, :].reshape(1, -1)
gp = gp.condition_on_observations(candidate_point, observation)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_model(mll)

I get that weird fit. But if I simply do

gp = SingleTaskGP(tx[0: n], ty[0: n], likelihood)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_model(mll)

with n=7, I get a properly fit GP.

Balandat commented 4 years ago

Hmm, I see. The condition_on_observations functionality really exists to be able to incorporate new data while avoiding to refit the model (i.e. if one expects the hyperparameters not to change much). So this isn't a huge practical issue I think, but nevertheless I don't see why this shouldn't work. Maybe there is some inadvertent graph-cutting going on, I'll try to take a closer look some time next week.

Balandat commented 4 years ago

Opened a new issue for this behavior: #356.