pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.05k stars 390 forks source link

[Bug] Exaggerated Lengthscale #1745

Closed yyexela closed 1 year ago

yyexela commented 1 year ago

🐛 Bug

When training a simple, custom GPyTorch model, sometimes the length-scale of the kernel is incredibly small (smaller than the order of 1e-100). Then, calculating the Cholesky decomposition fails and the symeig method is used. Sometimes, this doesn't cause anything to break, but other times there are extremely negative eigenvalues. This then causes a run time error later down the line in optimizing the qNEI acquisition function because some NaN values are produced. Adding a constraint on the lengthscale of the kernel resolves the issue, but instead I'm seeing that the lengthscale after optimization with fit_gpytorch_mll bounces back and forth between my bounds (1e-3 to 1e3) most of the time.

I'm considering this a BoTorch bug since it only occurs when using fit_gpytorch_mll. When I use GPyTorch fitting only (via use_gpytorch_fit_only = True in my code below), the lengthscales are reasonable values.

So, I have two concerns with fit_gpytorch_mll: 1) Why is it that the lengthscale being found is sometimes so small? (it also gets very big but this doesn't cause issues) 2) Why, when I add the lengthscale constraint, does fit_gpytorch_mll bounce between my bounds?

In the below code, running with

use_lengthscale_constraint = False
use_custom_lengthscale_prior = False
use_gpytorch_fit_only = False

reproduces this error and

use_lengthscale_constraint = True
use_custom_lengthscale_prior = True
use_gpytorch_fit_only = False

fixes it.

Code snippet to reproduce

# Imports
import gpytorch
import torch
from botorch import fit_gpytorch_mll
from botorch.acquisition.monte_carlo import qNoisyExpectedImprovement
from botorch.models.gpytorch import GPyTorchModel
from botorch.sampling.normal import SobolQMCNormalSampler
from botorch.settings import debug as botorch_debug
from gpytorch.constraints import Interval
from botorch.optim import optimize_acqf
import numpy as np
botorch_debug._set_state(True)

# Set default data type to avoid numerical issues
torch.set_default_dtype(torch.double)

# Set the dimension we're working with
dim = 3

# Set the bounds for each dimension
lb, ub = -2., 2.

# Noise added to outputs of Rosenbrock
noise = 0.05

# Options
use_lengthscale_constraint = False
use_custom_lengthscale_prior = False
use_gpytorch_fit_only = False
seed = 0

# Set up seed
torch.manual_seed(seed)
np.random.seed(seed)
print(f"Set seed {seed}\n")

# GP model we're using
class ExactGPModel(GPyTorchModel, gpytorch.models.ExactGP):
    def __init__(self, train_X, train_Y):
        likelihood = gpytorch.likelihoods.GaussianLikelihood()
        super().__init__(train_X, train_Y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        if use_lengthscale_constraint:
            self.base_kernel = gpytorch.kernels.RBFKernel(
                lengthscale_constraint=Interval(1e-3, 1e3)
            )
        else:
            self.base_kernel = gpytorch.kernels.RBFKernel()
        self.covar_module = gpytorch.kernels.ScaleKernel(self.base_kernel)
        if use_custom_lengthscale_prior:
            self.covar_module.base_kernel.initialize(lengthscale=1.0)
        self._num_outputs = 1

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Initial the model based on input data
def initialize_model(train_X, train_Y):
    model = ExactGPModel(train_X, train_Y) # Define GP model
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(model.likelihood, model) # Define our MLL
    return mll, model

# Train the model
def train_model(model, mll, train_X, train_Y):
    print("\nLengthscale before: ", model.covar_module.base_kernel.lengthscale.to_dense())
    print("Noise before:       ", model.likelihood.noise.item())
    if use_gpytorch_fit_only:
        # GPyTorch fitting
        fit_gpytorch(model, mll, train_X, train_Y)
    else:
        try:
            # BoTorch fitting
            fit_gpytorch_mll(mll, max_attempts=100)
        except:
            # GPyTorch fitting
            print(f"train_model: Failed to fit model with fit_gpytorch_mll, attempting GPyTorch fitting.")
            fit_gpytorch(model, mll, train_X, train_Y)
    print("\nLengthscale after: ", model.covar_module.base_kernel.lengthscale.to_dense())
    print("Noise after:       ", model.likelihood.noise.item())
    return None

# Fit model using GPyTorch tutorial method
def fit_gpytorch(model, mll, train_X, train_Y):
    # Number of initial training steps
    training_iter = 500

    # Find optimal model hyperparameters in training
    model.train()
    model.likelihood.train()

    # Use the adam optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

    for _ in range(training_iter):
        # Zero gradients from previous iteration
        optimizer.zero_grad()
        # Output from model
        output = model(train_X)
        # Calc loss and backprop gradients
        loss = -mll(output, train_Y)
        loss.backward()
        optimizer.step()

    # Set model back to evaluation mode
    model.eval()
    model.likelihood.eval()
    return None

# Generate n initial data points from Rosenbrock randomly over the search space
def get_initial_data(n):
    # Get points to evaluate at
    vecs = generate_search_space(n)

    # Evaluate Rosenbrock function
    objective_vals = torch.empty(n)
    derivative_vals = torch.empty((n, dim))
    for i in range(n):
        vec = vecs[i]
        obj, deriv = Rosenbrock(vec)
        objective_vals[i] = obj
        derivative_vals[i] = deriv

    return vecs, objective_vals, derivative_vals

# Generate a grid of initial points
def generate_search_space(n):
    vecs = torch.rand((n,dim))*(ub-lb)+lb
    return vecs

# Rosenbrock function multiplied by -1 to make this a maximization problem (minimum is 0 at 1 for each coordinate)
# Note: This only evaluates a single X at a time (not a vectorized function)
def Rosenbrock(X):
    # Cast to calculate gradient
    X.requires_grad_(True)

    # Get dimensions
    N = X.shape[0]

    # Vector containing individual values in summation
    cur_sum = torch.zeros_like(torch.empty(N-1))

    # Evaluate generaled N=dimensional Rosenbrock function
    for i in range(0,N-1):
        cur_sum[i] = 100*(X[i+1]-X[i]**2)**2+(1-X[i])**2

    # Compute sum
    Y = torch.sum(cur_sum)

    # Make this a maximization problem!
    Y = -1*Y

    # Calculate gradient
    Y.backward()
    grads = X.grad

    # Add noise
    grads += torch.normal(0,1,grads.shape)*noise
    Y += torch.normal(0,1,Y.shape)*noise

    # Convert to doubles
    grads = grads.to(torch.double)
    Y = Y.to(torch.double)

    # Return function value and the gradient
    return Y.detach(), grads

# Gets maximum of acquisition function
def get_candidate(model, train_X):
    #scal_transf = ScalarizedPosteriorTransform(weights=self.scale_tensor)

    # Define qNEI acquisition function
    # I got the sampler from here:
    #   https://botorch.org/api/_modules/botorch/acquisition/monte_carlo.html#qNoisyExpectedImprovement
    sampler = SobolQMCNormalSampler(1024) 
    qNEI = qNoisyExpectedImprovement(model,
                train_X,
                sampler,
                #posterior_transform=scal_transf,
            )

    # Set bounds for optimization
    bounds = torch.tensor([[lb]*dim, [ub]*dim])

    # Get candidate values
    candidates, _ = optimize_acqf(
        acq_function=qNEI,
        bounds=bounds,
        q=1,
        num_restarts=100,
        raw_samples=512,  # used for intialization heuristic
        options={"batch_limit": 1, "maxiter": 1000},
    )

    return candidates

def run_BO_loop():
    # Get initial data
    init_X, init_Y, _ = get_initial_data(5)
    print("Initial X:")
    print(init_X)
    print("Initial Y:")
    print(init_Y)
    print()

    # Run BO loop
    BO_iters = 100
    for i in range(BO_iters):
        print(f"<------ BO iteration {i+1} of {BO_iters} ------>")
        mll, model = initialize_model(init_X, init_Y)
        train_model(model, mll, init_X, init_Y)
        cd = get_candidate(model, init_X)
        print("\nCandidate is:")
        print(cd)

        # Evaluate the objective and add it to the list of data for the model
        Y1, Y2 = Rosenbrock(cd[0])
        Y = torch.cat([Y1.unsqueeze(0),Y2.squeeze()])
        Y = Y[0]
        print("\nObjective is:")
        print(Y)

        # Add to training data
        init_X = torch.vstack((init_X, cd))
        init_Y = torch.hstack((init_Y, Y))

        print()

run_BO_loop()

Stack trace/error message (Note: the failure depends on the seed and version of packages used, after updating all my packages it failed on iteration 49, so try a different seed if you didn't encounter the error)

<------ BO iteration 9 of 100 ------>

Lengthscale before:  tensor([[0.6931]], grad_fn=<SoftplusBackward0>)
Noise before:        0.6932471805574191

Lengthscale after:  tensor([[1.6933e-117]], grad_fn=<SoftplusBackward0>)
Noise after:        97250.44851795433
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/utils/cholesky.py:40: NumericalWarning: A not p.d., added jitter of 1.0e-08 to the diagonal
  warnings.warn(
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/utils/cholesky.py:40: NumericalWarning: A not p.d., added jitter of 1.0e-07 to the diagonal
  warnings.warn(
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/utils/cholesky.py:40: NumericalWarning: A not p.d., added jitter of 1.0e-06 to the diagonal
  warnings.warn(
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/utils/cholesky.py:40: NumericalWarning: A not p.d., added jitter of 1.0e-05 to the diagonal
  warnings.warn(
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/utils/cholesky.py:40: NumericalWarning: A not p.d., added jitter of 1.0e-04 to the diagonal
  warnings.warn(
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/utils/cholesky.py:40: NumericalWarning: A not p.d., added jitter of 1.0e-03 to the diagonal
  warnings.warn(
/home/alexey/.local/lib/python3.10/site-packages/linear_operator/operators/_linear_operator.py:2032: NumericalWarning: Runtime Error when computing Cholesky decomposition: Matrix not positive definite after repeatedly adding jitter up to 1.0e-03.. Using symeig method.
  warnings.warn(
Traceback (most recent call last):
  File "/home/alexey/School/Research Project/Code/BOTorchDerivative/qNEI/Rosenbrock/bug.py", line 227, in <module>
    run_BO_loop()
  File "/home/alexey/School/Research Project/Code/BOTorchDerivative/qNEI/Rosenbrock/bug.py", line 210, in run_BO_loop
    cd = get_candidate(model, init_X)
  File "/home/alexey/School/Research Project/Code/BOTorchDerivative/qNEI/Rosenbrock/bug.py", line 184, in get_candidate
    candidates, _ = optimize_acqf(
  File "/home/alexey/.conda/envs/thesis/lib/python3.10/site-packages/botorch/optim/optimize.py", line 305, in optimize_acqf
    batch_candidates, batch_acq_values, ws = _optimize_batch_candidates(timeout_sec)
  File "/home/alexey/.conda/envs/thesis/lib/python3.10/site-packages/botorch/optim/optimize.py", line 293, in _optimize_batch_candidates
    batch_candidates_curr, batch_acq_values_curr = gen_candidates_scipy(
  File "/home/alexey/.conda/envs/thesis/lib/python3.10/site-packages/botorch/generation/gen.py", line 230, in gen_candidates_scipy
    res = minimize_with_timeout(
  File "/home/alexey/.conda/envs/thesis/lib/python3.10/site-packages/botorch/optim/utils/timeout.py", line 82, in minimize_with_timeout
    return optimize.minimize(
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/_minimize.py", line 623, in minimize
    return _minimize_lbfgsb(fun, x0, args, jac, bounds,
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/lbfgsb.py", line 306, in _minimize_lbfgsb
    sf = _prepare_scalar_function(fun, x0, jac=jac, args=args, epsilon=eps,
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/optimize.py", line 261, in _prepare_scalar_function
    sf = ScalarFunction(fun, x0, args, grad, hess,
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 140, in __init__
    self._update_fun()
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 233, in _update_fun
    self._update_fun_impl()
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in update_fun
    self.f = fun_wrapped(self.x)
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 134, in fun_wrapped
    return fun(np.copy(x), *args)
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/optimize.py", line 74, in __call__
    self._compute_if_needed(x, *args)
  File "/home/alexey/.local/lib/python3.10/site-packages/scipy/optimize/optimize.py", line 68, in _compute_if_needed
    fg = self.fun(x, *args)
  File "/home/alexey/.conda/envs/thesis/lib/python3.10/site-packages/botorch/generation/gen.py", line 199, in f_np_wrapper
    raise RuntimeError(msg)
RuntimeError: 3 elements of the 3 element gradient array `gradf` are NaN. This often indicates numerical issues.

Expected Behavior

I expect to be able to obtain a candidate value from the qNEI acquisition function and continue iterating through my BO loop. Something like this:

<------ BO iteration 9 of 100 ------>

Lengthscale before:  tensor([[0.6931]], grad_fn=<SoftplusBackward0>)
Noise before:        0.6932471805574191

Lengthscale after:  tensor([[1.6933e-117]], grad_fn=<SoftplusBackward0>)
Noise after:        97250.44851795433

Candidate is:
tensor([[-0.4615,  1.4989,  0.6497]])

Objective is:
tensor(-422.8845)

System information

Please complete the following information:

>>> print(botorch.__version__)
0.8.2
>>> print(gpytorch.__version__)
1.9.1
>>> print(torch.__version__)
2.0.0
$ uname -r
6.1.8-arch1-1

Additional context

I ran into this issue when comparing derivative enabled GPs with non-derivative enabled ones. The derivative enabled GP doesn't run into the NaN issue even though sometimes its lengthscales are exaggerated as well.

Also, see here for a relevant TODO I found as well. I found it when debugging the covariance matrix and seeing a very negative eigenvalue for what should be at minimum a positive semi definite matrix.

Balandat commented 1 year ago

Why is it that the lengthscale being found is sometimes so small? (it also gets very big but this doesn't cause issues)

Good question. A fundamental difference between fit_gpytorch_mll and the standard gpytorch fitting loop is that fit_gpytorch_mll by default uses L-BFGS to fit the model. So it's not a standard first-order gradient descent method that with limited learning rate will behave rather docile - instead it can run off to very large (and numerically problematic) values - especially since there aren't priors and/or constraints on many of the model parameters. We'd have to take a closer look what exactly is going on there (not ruling out a bug in your [or our] code, but haven't looked through this in complete detail).

Why, when I add the lengthscale constraint, does fit_gpytorch_mll bounce between my bounds?

This is intriguing. It could be that the MLL is multi-modal and both of the extreme value solutions can (in conjunction with the other parameters) explain the data well (or at least better than other solutions on the interior).

@j-wilson, @dme65 any other thoughts / intuition on this?

j-wilson commented 1 year ago

@yyexela I don't see an obvious reason why this is happening. It may simply be a model initialization issue (as Max suggested). I recommend visualizing 1-2 dimensional loss landscapes to help diagnose. The problem seems to go away upon standardizing the outcome values in run_BO_loop as:

for i in range(BO_iters):
    print(f"<------ BO iteration {i+1} of {BO_iters} ------>")
    train_Y = (init_Y - init_Y.mean())/init_Y.std()
    mll, model = initialize_model(init_X, train_Y)
    train_model(model, mll, init_X, train_Y)
    cd = get_candidate(model, init_X)
    ... 

p.s. You may also want to consider using ARD lengthscales.

yyexela commented 1 year ago

Thanks @j-wilson and @Balandat ! Normalizing my training data fixed the issue.