pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.09k stars 400 forks source link

Numerical issue with cholesky decomposition (even with normalization) #179

Closed michaelyli closed 5 years ago

michaelyli commented 5 years ago

Issue description

I am consistently running into numerical issues when running fit_gpytorch_model(). I am normalizing the inputs and standardizing the outputs (as described in issue 160)

Code example

fit_gpytorch_model(mll)
EI = ExpectedImprovement(gp, best_f=0.1)

## optimize acquisition function
candidates = joint_optimize(
    acq_function=EI,
    q = 1, 
    bounds = bounds,
    num_restarts=10,
    raw_samples=500,  # used for intialization heuristic
)

new_x = candidates.detach()
exact_obj = neg_eggholder(new_x)

train_x_ei = torch.cat([train_x_ei, candidates])
train_y_ei = torch.cat([train_y_ei, exact_obj])

gp = SingleTaskGP(
    normalize(train_x_ei, bounds=bounds), 
    standardize(train_y_ei)
    )
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)

Here is the error message:

RuntimeError Traceback (most recent call last)

in 2 3 ----> 4 fit_gpytorch_model(mll) 5 EI = ExpectedImprovement(gp, best_f=0.1) 6 C:\Anaconda3\envs\py36_new\lib\site-packages\botorch\fit.py in fit_gpytorch_model(mll, optimizer, **kwargs) 33 """ 34 mll.train() ---> 35 mll, _ = optimizer(mll, track_iterations=False, **kwargs) 36 mll.eval() 37 return mll C:\Anaconda3\envs\py36_new\lib\site-packages\botorch\optim\fit.py in fit_gpytorch_scipy(mll, bounds, method, options, track_iterations) 186 jac=True, 187 options=options, --> 188 callback=cb, 189 ) 190 iterations = [] C:\Anaconda3\envs\py36_new\lib\site-packages\scipy\optimize\_minimize.py in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options) 599 elif meth == 'l-bfgs-b': 600 return _minimize_lbfgsb(fun, x0, args, jac, bounds, --> 601 callback=callback, **options) 602 elif meth == 'tnc': 603 return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback, C:\Anaconda3\envs\py36_new\lib\site-packages\scipy\optimize\lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options) 333 # until the completion of the current minimization iteration. 334 # Overwrite f and g: --> 335 f, g = func_and_grad(x) 336 elif task_str.startswith(b'NEW_X'): 337 # new iteration C:\Anaconda3\envs\py36_new\lib\site-packages\scipy\optimize\lbfgsb.py in func_and_grad(x) 283 else: 284 def func_and_grad(x): --> 285 f = fun(x, *args) 286 g = jac(x, *args) 287 return f, g C:\Anaconda3\envs\py36_new\lib\site-packages\scipy\optimize\optimize.py in function_wrapper(*wrapper_args) 298 def function_wrapper(*wrapper_args): 299 ncalls[0] += 1 --> 300 return function(*(wrapper_args + args)) 301 302 return ncalls, function_wrapper C:\Anaconda3\envs\py36_new\lib\site-packages\scipy\optimize\optimize.py in __call__(self, x, *args) 61 def __call__(self, x, *args): 62 self.x = numpy.asarray(x).copy() ---> 63 fg = self.fun(x, *args) 64 self.jac = fg[1] 65 return fg[0] C:\Anaconda3\envs\py36_new\lib\site-packages\botorch\optim\fit.py in _scipy_objective_and_grad(x, mll, property_dict) 221 output = mll.model(*train_inputs) 222 args = [output, train_targets] + _get_extra_mll_args(mll) --> 223 loss = -mll(*args).sum() 224 loss.backward() 225 param_dict = OrderedDict(mll.named_parameters()) C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\module.py in __call__(self, *inputs, **kwargs) 20 21 def __call__(self, *inputs, **kwargs): ---> 22 outputs = self.forward(*inputs, **kwargs) 23 if isinstance(outputs, list): 24 return [_validate_module_outputs(output) for output in outputs] C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\mlls\exact_marginal_log_likelihood.py in forward(self, output, target, *params) 26 # Get the log prob of the marginal distribution 27 output = self.likelihood(output, *params) ---> 28 res = output.log_prob(target) 29 30 # Add terms for SGPR / when inducing points are learned C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\distributions\multivariate_normal.py in log_prob(self, value) 127 128 # Get log determininat and first part of quadratic form --> 129 inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True) 130 131 res = -0.5 * sum([inv_quad, logdet, diff.size(-1) * math.log(2 * math.pi)]) C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\lazy\lazy_tensor.py in inv_quad_logdet(self, inv_quad_rhs, logdet, reduce_inv_quad) 990 from .chol_lazy_tensor import CholLazyTensor 991 --> 992 cholesky = CholLazyTensor(self.cholesky()) 993 return cholesky.inv_quad_logdet(inv_quad_rhs=inv_quad_rhs, logdet=logdet, reduce_inv_quad=reduce_inv_quad) 994 C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\lazy\lazy_tensor.py in cholesky(self, upper) 716 (LazyTensor) Cholesky factor (lower triangular) 717 """ --> 718 res = self._cholesky() 719 if upper: 720 res = res.transpose(-1, -2) C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\utils\memoize.py in g(self, *args, **kwargs) 32 cache_name = name if name is not None else method 33 if not is_in_cache(self, cache_name): ---> 34 add_to_cache(self, cache_name, method(self, *args, **kwargs)) 35 return get_from_cache(self, cache_name) 36 C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\lazy\lazy_tensor.py in _cholesky(self) 401 evaluated_mat.register_hook(_ensure_symmetric_grad) 402 --> 403 cholesky = psd_safe_cholesky(evaluated_mat.double()).to(self.dtype) 404 return NonLazyTensor(cholesky) 405 C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\utils\cholesky.py in psd_safe_cholesky(A, upper, out, jitter) 45 continue 46 ---> 47 raise e 48 49 C:\Anaconda3\envs\py36_new\lib\site-packages\gpytorch\utils\cholesky.py in psd_safe_cholesky(A, upper, out, jitter) 19 """ 20 try: ---> 21 L = torch.cholesky(A, upper=upper, out=out) 22 # TODO: Remove once fixed in pytorch (#16780) 23 if A.dim() > 2 and A.is_cuda: RuntimeError: cholesky_cpu: U(2,2) is zero, singular U. ## System Info BoTorch 0.1.0 GPyTorch 0.3.2 Torch 1.1.0 Windows 10
Balandat commented 5 years ago

Hi, thanks for flagging this. We’re aware of this issue and are actively working on improving robustness of the fitting. The underlying cause is that the line search in the L-BFGS algorithm that we use by default in some situations may end up taking some very large steps, which in turn causes numerical issues in the solves in the underlying gpytorch model.

As a band aid, you can for now either just catch this exception and refit from a different initialization, or use an optimizer from torch.optim to fit the model. We hope to get a proper solution for this issue in soon.

eytan commented 5 years ago

cc @jacobrgardner. We will be seeing what we can do to temper the optimizer, but I also wonder if we can clamp some of the GP HPs.

michaelyli commented 5 years ago

Hey thanks! Could you explain what you mean by these two bandaids: "refit from a different initialization" "or use an optimizer from torch.optim to fit the model"

Sorry, I'm new to BoTorch!

jacobrgardner commented 5 years ago

@eytan Given that you have the machinery to use scipy optimizers in place already, maybe consider using L-BFGS-B instead of L-BFGS by default in fit? I'm assuming the line search has to be to pretty extremely unreasonable hyperparameter settings, so even fairly loose box constraints with L-BFGS-B will probably solve the problem entirely.

edit: If we're already using box constraints by default, maybe they could just be made tighter by default? The only real way the kernel would become not positive definite on normalized data is either (1) the noise becoming to low or (2) the lengthscale becoming too large (or both).

Balandat commented 5 years ago

No worries @youngapsmikes , I was on mobile just then hence the short reply.

refit from a different initialization

When instantiating the SingleTaskGP, the hyperparameters get initialized to some initial values. These values will affect the path the optimizer takes, and the optimization may work fine when starting from a different set of initial conditions. You can see this explicitly here, where the initial value for the noise parameter of the likelihood is initialized explicitly to the mode of the prior. The other parameters are by gpytorch's default initialized to zero in the untransformed space, which by default in the real space means a value of softplus(0.0) (this isn't exactly obvious, but if you trace down the instantiation of say the kernel here you will eventually find that in the gpytorch code. To change the parameters, you can call initialize on the respective model. This will require some understanding of the parameter transformations though and may not be the easiest starting point. There is an open issue on the gpytorch side for making it easier to sample from the parameter priors, which would simplify this significantly.

use an optimizer from torch.optim to fit the model

By default fit_gpytorch_model will use the scipy L-BFGS-B optimizer to maximize the MLL of the model (with the occasional numerical issues documented above). Instead, you can use a different optimization algorithm that doesn't perform a line search and hence should be more robust.

You can do this in a convenient way by passing the fit_gpytorch_torch callable into fit_gpytorch_model like so:

from botorch.optim.fit import fit_gpytorch_torch
...
 fit_gpytorch_model(mll, optimizer=fit_gpytorch_torch, **kwargs)

where kwargs are keyword arguments that will be passed into fit_gpytorch_torch internally. Note though that the check_convergence function used internally is very rudimentary at this point and basically just properly uses the maxiter kwarg that specifies the number of steps.

You can also do this fully manually by following this tutorial You'll have to adjust the learning rate and the number of iterations so that the loss converges.

Balandat commented 5 years ago

Given that you have the machinery to use scipy optimizers in place already, maybe consider using L-BFGS-B instead of L-BFGS by default in fit? I'm assuming the line search has to be to pretty extremely unreasonable hyperparameter settings, so even fairly loose box constraints with L-BFGS-B will probably solve the problem entirely.

@jacobrgardner, we are already using L-BFGS-B by default, but don't impose explicit bounds for all hyperparameters in our packaged models. We should probably change this. There is a slight inconvenience in the gpytorch constraints setup, where interval (incl. one-sided) constraints can either be enforced through a transform or are "not enforced" in which case we can easily pass the bound into L-BFGS-B. But that means L-BFGS-B works in the raw parameter space, which may not be exactly what one wants - I have some anecdotal evidence that using a transform also helps with robustness (in addition to having an explicit bound in the raw space). Do you have any thoughts on what an interface tweak could look like that would allow doing both? That is, (i) using a transform on the raw parameters, and (ii) providing a bound to be honored by L-BFGS-B (ideally with the ability to set/read in both raw and transformed space).

sdaulton commented 5 years ago

As @jacobrgardner noted,

The only real way the kernel would become not positive definite on normalized data is either (1) the noise becoming to low or (2) the lengthscale becoming too large (or both).

Anecdotally, I have found that the bandaid approach of increasing the MIN_INFERRED_NOISE_LEVEL (https://github.com/pytorch/botorch/blob/master/botorch/models/gp_regression.py#L35) to 1e-5 or 1e-4 typically increases stability. @youngapsmikes, one way you can adjust the noise constraint for the SingleTaskGP is by using a custom likelihood with a tighter noise_constraint and passing the likelihood to the SingleTaskGP. E.g.

noise_prior = GammaPrior(1.1, 0.05)
noise_prior_mode = (noise_prior.concentration - 1) / noise_prior.rate
MIN_INFERRED_NOISE_LEVEL = 1e-5
likelihood = GaussianLikelihood(
    noise_prior=noise_prior,
    noise_constraint=GreaterThan(
        MIN_INFERRED_NOISE_LEVEL,
        transform=None,
        initial_value=noise_prior_mode,
    ),
)
model = SingleTaskGP(train_X=train_X, train_Y=train_Y, likelihood=likelihood)
Balandat commented 5 years ago

FWIW, we increased the default MIN_INFERRED_NOISE_LEVEL to 1e-4 in e2c64fef1e76d526a33951c5eb75ac38d5581257 (which is also the gpytorch default).

We are also working on adding error handling to the fitting procedure.

michaelyli commented 5 years ago

@Balandat Thanks! One quick followup so I can make sure I understand the numerical issues and the proposed changes. When optimizing the hyperparameters sometimes the line search takes large stepsizes which can cause some convergence issues; however, playing around with the initialized values of the hyperparameters may circumvent this issue. Is your first proposal to basically change the initial set of parameters so that we might start from more "favorable" initial conditions? I guess an intelligent approach to doing this is to use the priors on the parameters to inform this initialization? Also, out of curiosity, any justification for why increasing the MIN_INFERRED_NOISE_LEVEL tends to improve stability?

Balandat commented 5 years ago

Is your first proposal to basically change the initial set of parameters so that we might start from more "favorable" initial conditions?

That is correct.

I guess an intelligent approach to doing this is to use the priors on the parameters to inform this initialization?

Indeed. @danielrjiang has some code that automatically re-samples values from the prior if the optimization fails. We should be able to get this into master within the next few days.

Also, out of curiosity, any justification for why increasing the MIN_INFERRED_NOISE_LEVEL tends to improve stability?

Yes. Basically in GP inference you have to solve a square linear system Ax = b where A = K_xx + sigma I with K_xx the kernel matrix and sigma is the noise. As sigma goes to zero A approaches a singular matrix, which causes numerical issues.

Balandat commented 5 years ago

184 should improve robustness of the fitting (especially in conjunction w/ the change increasing the default MIN_INFERRED_NOISE_LEVEL to 1e-4).

We also have a PR in the works for resampling from the prior and re-starting the fitting, which should improve robustness even further.

Balandat commented 5 years ago

This is it: #188

Balandat commented 5 years ago

These recent changes should make the model fitting quite a bit more robust on master. We're planning to release a new version tomorrow.

I'm going to close this out for now, please feel free to re-open if you still run into issues.

Tenoke commented 5 years ago

I still get this cholesky issue all the time on master. Increasing MIN_INFERRED_NOISE_LEVEL even more to 1e-3 fixes it for me.

Specifically, I am using:

from gpytorch.priors.torch_priors import GammaPrior
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.constraints.constraints import GreaterThan

noise_prior = GammaPrior(1.1, 0.05)
noise_prior_mode = (noise_prior.concentration - 1) / noise_prior.rate
MIN_INFERRED_NOISE_LEVEL = 1e-3 # increase from default due to cholesky issue
    likelihood = GaussianLikelihood(
        noise_prior=noise_prior,
        noise_constraint=GreaterThan(
            MIN_INFERRED_NOISE_LEVEL,
            transform=None,
            initial_value=noise_prior_mode,
        ),
    )
Balandat commented 5 years ago

Are you sure you are running this on master and not some older version? You shouldn't actually see that cholesky error anymore since we're catching it here.

Also, would it be possible to share the data that causes these issues for you?

Tenoke commented 5 years ago

I am installing it directly from github. I setup a small case where you can reliably produce the issue on colab but where increasing MIN_INFERRED_NOISE_LEVEL fixes it.

Colab link

Balandat commented 5 years ago

Ah ok I see from your example that the error appears when evaluating the mll outside fit_gpytorch_model, when you're optimizing the mll over the input z. No wonder it's not being caught by our error handling.

I'm not surprised that you end up running into issues like this, likely the posterior at z during becomes degenerate as your z start getting clumped together during the optimization process. As that posterior becomes degenerate, numerical issues in computing the likelihood are to be expected.

Taking a step back, what's your higher level goal in this example? Why are you maximizing the MLL over the inputs z here?

Tenoke commented 5 years ago

That makes sense.

I am optimizing so I can produce the input most likely to give me the output I desire - basically as an acquisition method. Similar to how you can use e.g. UCB to get candidates (this performs about as well but slower than joint_optimize(UCB..) so it kind of works even like that), but this is just the first step. (The real evaluation function might for example be a human deciding whether the inputs are 0 or a 1, so I want the best candidate to be able to generate only outputs which evaluate to 1 in the end.)

Ideally, I'd want to use something like Thompson sampling so I can get the next candidate but you can only get a sample of model at point X (via model(X).sample()) rather than a sample over your whole input space in gpytorch so I am testing alternative approaches.

Balandat commented 5 years ago

I see. Yeah if you're doing that kind of optimization without any kind of regularization then I'm not surprised you end up with degenerate posteriors.

Not quite sure what the right kind of regularization would be here (I'd have to think about this more). But you can handle these issues more gracefully by catching the specific error as we do here and then deal with that situation in another way.

Ideally, I'd want to use something like Thompson sampling

Yeah that's an interesting approach but as you said not straightforward to do. To do proper TS you'd basically have to draw a latent function and then optimize that while re-conditioning on what you "observe" during that optimization. Lmk if you figure out a way to do that...

michaelyli commented 5 years ago

So i'm still able to see this error even though I installed the latest version from master: RuntimeError: cholesky_cpu: For batch 0: U(10,10) is zero, singular

My botorch version is: 0.1.1 My gpytorch version is: 0.3.3

This time the error seems to be triggered by joint_optimize not the fitting though:

    candidates = joint_optimize(
        acq_function=acq_function,
        q = 1, 
        bounds = bound,
        num_restarts=10,
        raw_samples=500,  # used for intialization heuristic
    )

:\anaconda3\envs\botorch\lib\site-packages\botorch\optim\optimize.py in joint_optimize(acq_function, bounds, q, num_restarts, raw_samples, options, inequality_constraints, equality_constraints, fixed_features, post_processing_func) 150 num_restarts=num_restarts, 151 raw_samples=raw_samples, --> 152 options=options, 153 ) 154 batch_limit = options.get("batch_limit", num_restarts)

c:\anaconda3\envs\botorch\lib\site-packages\botorch\optim\optimize.py in gen_batch_initial_conditions(acq_function, bounds, q, num_restarts, raw_samples, options) 247 while start_idx < X_rnd.shape[0]: 248 end_idx = min(start_idx + batch_limit, X_rnd.shape[0]) --> 249 Y_rnd_curr = acq_function(X_rnd[start_idx:end_idx]) 250 Y_rnd_list.append(Y_rnd_curr) 251 start_idx += batch_limit

c:\anaconda3\envs\botorch\lib\site-packages\torch\nn\modules\module.py in call(self, *input, kwargs) 491 result = self._slow_forward(*input, *kwargs) 492 else: --> 493 result = self.forward(input, kwargs) 494 for hook in self._forward_hooks.values(): 495 hook_result = hook(self, input, result)

c:\anaconda3\envs\botorch\lib\site-packages\botorch\utils\transforms.py in decorated(cls, X) 126 ) 127 X = X if X.dim() > 2 else X.unsqueeze(0) --> 128 return method(cls, X) 129 130 return decorated

c:\anaconda3\envs\botorch\lib\site-packages\botorch\acquisition\analytic.py in forward(self, X) 96 """ 97 self.best_f = self.best_f.to(X) ---> 98 posterior = self.model.posterior(X) 99 self._validate_single_output_posterior(posterior) 100 mean = posterior.mean

c:\anaconda3\envs\botorch\lib\site-packages\botorch\models\gpytorch.py in posterior(self, X, output_indices, observation_noise, **kwargs) 185 X=X, original_batch_shape=self._input_batch_shape 186 ) --> 187 mvn = self(X) 188 if observation_noise: 189 mvn = self.likelihood(mvn, X)

c:\anaconda3\envs\botorch\lib\site-packages\gpytorch\models\exact_gp.py in call(self, *args, *kwargs) 275 276 # Get the joint distribution for training/test data --> 277 full_output = super(ExactGP, self).call(full_inputs, **kwargs) 278 if settings.debug().on(): 279 if not isinstance(full_output, MultivariateNormal):

c:\anaconda3\envs\botorch\lib\site-packages\gpytorch\module.py in call(self, *inputs, kwargs) 20 21 def call(self, *inputs, *kwargs): ---> 22 outputs = self.forward(inputs, kwargs) 23 if isinstance(outputs, list): 24 return [_validate_module_outputs(output) for output in outputs]

c:\anaconda3\envs\botorch\lib\site-packages\botorch\models\gp_regression.py in forward(self, x) 115 mean_x = self.mean_module(x) 116 covar_x = self.covar_module(x) --> 117 return MultivariateNormal(mean_x, covar_x) 118 119

c:\anaconda3\envs\botorch\lib\site-packages\gpytorch\distributions\multivariate_normal.py in init(self, mean, covariance_matrix, validate_args) 42 super(TMultivariateNormal, self).init(batch_shape, event_shape, validate_args=False) 43 else: ---> 44 super().init(loc=mean, covariance_matrix=covariance_matrix, validate_args=validate_args) 45 46 @property

c:\anaconda3\envs\botorch\lib\site-packages\torch\distributions\multivariate_normal.py in init(self, loc, covariance_matrix, precision_matrix, scale_tril, validate_args) 140 if precision_matrix is not None: 141 self.covariance_matrix = torch.inverse(precision_matrix).expandas(loc) --> 142 self._unbroadcasted_scale_tril = torch.cholesky(self.covariance_matrix) 143 144 def expand(self, batch_shape, _instance=None):

Balandat commented 5 years ago

@youngapsmikes does this happen also when you're running everything using the torch.double dtype? what's the model you are using? Is this an issue early on or does this happen in later iterations?

Balandat commented 5 years ago

Also, are you able to share the model data? We're running benchmarks on a bunch of standard test functions and are not seeing these issues - It would be very helpful to have the data to see whether you're running into some kind of edge case.

michaelyli commented 5 years ago

Hey, I am not running everything with torch.double dtype, I'm using float.

Sure, here's the for loop that's triggering the error. I'm using all library functions.

fit_gpytorch_model(mll) EI = ExpectedImprovement(gp, best_f=0.1)

candidates = ec.getPerturbCandidate(EI, 1.0)

new_x = candidates.detach()
exact_obj = neg_branin(new_x)

train_x_ei = torch.cat([train_x_ei, candidates])
train_y_ei = torch.cat([train_y_ei, exact_obj])

gp = SingleTaskGP(
    normalize(train_x_ei, bounds=bounds), 
    standardize(train_y_ei)
    )
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
michaelyli commented 5 years ago

A follow up to why I'm using a double instead of a float. I'm casting a numpy array to a tensor. When I cast it to a double, I get this error. bound = torch.from_numpy(bound) bound = bound.double() bound = bound.t() "Expected object of scalar type Float but got scalar type Double for sequence element 1 in sequence argument at position #1 'tensors'"

Balandat commented 5 years ago

Sorry what's getPerturbCanddidates here?

How many iterations in do you see the error? the branin function is pretty smooth and only two-dimensional, so if you do a ton of runs in a noiseless setting then it could be that the inferred noise level ends up being extremely small, which can lead to numerical issues. To see if that's what's going on here you could just add some small noise to the objective.

So I'm casting a numpy array to a tensor. When I cast it to a double, I get this error

This is likely b/c you're only changing the bounds - you'll have to make sure that all tensors involved are using the same torch.dtype. So the input tensors should all be doubles. You can put all parameters of the model to double by using model.to(dtype=double).

michaelyli commented 5 years ago

getPerturbCandidates() is a helper function I wrote basically to constrain the optimization of the acquisition function subject to standard box constraints.

I was worried that my constraints might be causing some of the numerical issues so I played around with the example below (standard use case of botorch). I played around with adding some noise to the Branin function as well as increasing MIN_INFERRED_NOISE. I still run into the numerical issue occasionally.

Here is the code:

NOISE_SE = 30.0 for exp in range(0, 5):

print("Experiment number" + str(exp))
train_x_ei =torch.stack([-5 + torch.rand(2,)*10,  0 + torch.rand(2, )*15]).t().to(torch.double)
train_y_ei = neg_branin(train_x_ei).to(torch.double)

gp = SingleTaskGP(
    normalize(train_x_ei, bounds=bounds), 
    standardize(train_y_ei)
    ).to(dtype = torch.double)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
mll.to(dtype = torch.double)    

for i in range(0, 10):

    print(i)

    ## fit models
    fit_gpytorch_model(mll)

    ## update acquisition function using 
    ## noisy values as approximation of best_f
    best_f = train_x_ei.min()
    EI = ExpectedImprovement(gp, best_f=best_f)

    ## optimize subject to coordinate descent constraint
    candidates = joint_optimize(
        acq_function=EI,
        q = 1, 
        bounds = bounds,
        num_restarts=10,
        raw_samples=500,  # used for intialization heuristic
    )              
    new_x = candidates.detach()

    ## observe function at new point
    exact_obj = neg_branin(new_x).to(torch.double)
    exact_obj = exact_obj + torch.normal(mean=0.0, std=torch.arange(1., 2.)*NOISE_SE).to(torch.double)

    train_x_ei = torch.cat([train_x_ei, candidates])
    train_y_ei = torch.cat([train_y_ei, exact_obj])

    gp = SingleTaskGP(
        normalize(train_x_ei, bounds=bounds), 
        standardize(train_y_ei)
        )
    gp.to(dtype = torch.double)
    mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
    mll.to(dtype = torch.double)

    print(train_x_ei)

Here is the error output: RuntimeError: cholesky_cpu: For batch 0: U(10,10) is zero, singular U.

If you look at the train_x_ei matrix right before the runtime error, you can observe that some rows are essentially identical which kind of accounts for the numerical issues.

tensor([[ 1.2249e+00, 1.2013e+01], [ 5.1114e-01, 1.1419e+01], [ 0.0000e+00, 7.3238e-01], [ 0.0000e+00, 6.9859e-01], [-2.8271e-13, 6.5770e-02], [-4.6318e-17, 2.2204e-16], [-1.2128e-15, 1.7722e-01], [-1.6227e-15, 1.2054e-01], [-4.8572e-17, 7.9413e-03], [ 0.0000e+00, 8.1456e-03]], dtype=torch.float64)

I will try playing with some higher dimensional, higher noise benchmark problems.

michaelyli commented 5 years ago

Also, as a followup, how do you recommend dealing with this in practice? Sometimes I may be running intensive benchmarks remotely on a cluster. Should I just catch this error in a try and except and restart the experiments?

sdaulton commented 5 years ago

@youngapsmikes, I think some of these issues are still coming from normalization/standardization.

train_x_ei =torch.stack([-5 + torch.rand(2,)*10, 0 + torch.rand(2, )*15]).t().to(torch.double) create a tensor of 2 random points in [-5, 10] x [0, 15]. Assuming bounds is set to [0, 1]^2, candidates = joint_optimize(...) will return candidates in [0,1]^2. These candidates need to unnormalized back to [-5, 10] x [0, 15] before concatenating them with train_x_ei. You can see in train_x_ei which you printed out that the first two points are in [-5, 10] x [0, 15] and the rest are in [0, 1]^2--- which will cause issues when you normalize this whole tensor to because the points originally in [0, 1]^2 will be very close together.

Using Ax helps avoid the pains of standardization and normalization.

Also, best_f = train_x_ei.min() -> best_f = standardize(train_y_ei).max()

michaelyli commented 5 years ago

Thanks so much (and for catching my bugs)! I ran a few experiments and haven't noticed any numerical issues.

Balandat commented 5 years ago

glad to hear

michaelyli commented 5 years ago

Sorry to comment on this again but do you have any suggestions for best practice for dealing with this error when it happens in case it does happen infrequently? Once you catch this error in a try and except, is it best practice to throw out that trial?

Balandat commented 5 years ago

So ideally this shouldn't happen - if it happens it might be a sign that something with either the normalization or the evaluation is problematic (maybe it returns outliers that are easy to detect as such and should be removed?). Maybe the underlying function is discontinuous and just not well described by a GP? Or, on the other extreme, the function is deterministic and constant, which would cause obvious issues with inferring the noise level. One would have to take a closer look at the data to see what exactly is going on.

The issue with just dropping these trials is that you likely end up with a biased model - if the errors happen infrequently but systematically for, say, large observed values, and you drop these trials, then your model will be biased towards lower values.

Jimbo994 commented 3 years ago

Hi, I have a similar but different issue. This thread seems the right place to place my problem. Instead of getting the NotPSD error when fitting the GP(s), I get this error during the optimization of the acquisition function.

Issue description

So I have pretty much one-to-one implemented this tutorial: https://botorch.org/tutorials/multi_objective_bo, but then applied to my problem. Results are quite good, about which I am happy. However, I am now at the stage where I am running many trials, with different random seeds and different random initial data points and this is where I observe the problem I'm facing. At some random seeds and at differing iterations, I sometimes get an error when optimizing the acquisition functions (see error below). It happens both for qParEGO and qEHVI, and both on CPU and GPU. I have also tried different reference points. I am quite puzzled about why this is. I have eye-balled my code for normalization and standardization, etc. and can't spot what is going wrong... Do you know what could be the cause of this? Many thanks in advance.

Code example

In order to be able to execute this, I will need to share more code, please let me know now if you want this.

def generate_random_data(path, search_bounds, bounds, n, N1, N2, dwell_vol=[0.196, 0.073], dead_vol=[0.40, 0.624]):
    p = pd.read_csv(path)
    D1_d = p[['lnk0_1', 'S_11', 'S_21']].values
    D2_d = p[['lnk0_2', 'S_12', 'S_22']].values

    experiments, scores, res_scores, ask_scores, time_scores = [], [], [], [], []
    for i in range(n):
        exp_params = np.random.uniform(low=search_bounds[0].cpu().numpy(), high=search_bounds[1].cpu().numpy())
        experiments.append(exp_params)
        # This is where I call my own objective function
        bad_pred, score, res_score, ask_score, time_score, min_res = objective_function_reduced_NK_NK(exp_params[0:4], exp_params[4:], D1_d, D2_d, dwell_vol, dead_vol, search_bounds,N1,N2, flowrate[0], flowrate[1], normalize_output=False)
        if bad_pred.count(0) == 2: 
            scores.append(score), res_scores.append(min_res*1.5), time_scores.append(time_score)
        else:  #Failsafe if objective function returns NaN or Inf values, rarely happens
            scores.append(0), res_scores.append(0), time_scores.append(0)
    train_x = normalize(torch.tensor(experiments, **tkwargs), search_bounds).to(**tkwargs)
    train_obj = torch.stack((torch.tensor(res_scores), torch.tensor(time_scores)), dim=1).to(**tkwargs)

    return train_x, train_obj, res_score, ask_score, time_score

def initialize_model(train_x, train_obj):
    # define models for objective and constraint Using default priors
    model = SingleTaskGP(train_x, train_obj, outcome_transform=Standardize(m=train_obj.shape[-1]))
    mll = ExactMarginalLogLikelihood(model.likelihood, model)
    return mll, model

def optimize_qehvi_and_get_observation(model, train_obj, sampler, search_bounds, D1_d, D2_d, dwell_vol, dead_vol, standard_bounds, N1, N2):
    """Optimizes the qEHVI acquisition function, and returns a new candidate and observation."""
    print('optimizing QEHVI')
    # partition non-dominated space into disjoint rectangles
    partitioning = NondominatedPartitioning(num_outcomes=2, Y=standardize(train_obj))
    acq_func = qExpectedHypervolumeImprovement(
        model=model,
        ref_point=ref_point,  # use known reference point
        partitioning=partitioning,
        sampler=sampler,
    )
    # optimize
    candidates, _ = optimize_acqf(
        acq_function=acq_func,
        bounds=standard_bounds,
        q=BATCH_SIZE,
        num_restarts=20,
        raw_samples=1024,  # used for intialization heuristic
        options={"batch_limit": 5, "maxiter": 200, "nonnegative": True},
        sequential=True,
    )
    # observe new values
    new_x = candidates.detach()
    new_x_norm = unnormalize(new_x, bounds=search_bounds)

    scores, rs_scores, ask_scores, exp_time_scores = [], [], [], []
    for i in range(BATCH_SIZE):
        bad_pred, score, rs_score_sum, ask_score, exp_time_score, min_res = objective_function_reduced_NK_NK(
            new_x_norm.cpu().numpy()[i][0:4], new_x_norm.cpu().numpy()[i][4:], D1_d, D2_d, dwell_vol, dead_vol,
            search_bounds, N1, N2, flowrate[0], flowrate[1], normalize_output=False)

        if bad_pred.count(0) == 2:
            scores.append(score), rs_scores.append(min_res*1.5), ask_scores.append(ask_score), exp_time_scores.append(
                exp_time_score)
        else:
            scores.append(0), rs_scores.append(0), ask_scores.append(0), exp_time_scores.append(0)

        new_obj = torch.stack((torch.tensor(rs_scores), torch.tensor(exp_time_scores)), dim=1).to(**tkwargs)
    return new_x, new_obj, scores, rs_scores, ask_scores, exp_time_scores

standard_bounds = torch.tensor([[0, 0, 0, 0,    0, 0, 0, 0],
                               [1, 1, 1, 1,    1, 1, 1, 1]], **tkwargs)

search_bounds = torch.tensor([[10., 0.1, 0.6, 40.,  0., 0.1, 0.6, 1.],
                              [20, 0.5, 1, 200.,  0.2, 0.5, 1, 3.]], **tkwargs)

BATCH_SIZE = 1
N_START = 3
N_TRIALS = 30
N_BATCH = 40
MC_SAMPLES = 128

ref_point = torch.tensor([1.3, 0.4], **tkwargs)
hv = Hypervolume(ref_point=ref_point)
maxhv = 1.5 #objectives are max 1.5 and 1 respectively.

for trial in range(1, N_TRIALS + 1):

    print(f"\nTrial {trial:>2} of {N_TRIALS} ", end="")
    torch.manual_seed(trial)
    hvs_qparego, hvs_qehvi, hvs_random = [], [], []

    # call helper functions to generate initial training data and initialize model
    train_x_qparego, train_obj_qparego, _, _, _ = generate_random_data(f, search_bounds, standard_bounds, N_START, N1, N2, dwell_vol, dead_vol)
    mll_qparego, model_qparego = initialize_model(train_x_qparego, train_obj_qparego)

    train_x_qehvi, train_obj_qehvi = train_x_qparego, train_obj_qparego
    train_x_random, train_obj_random = train_x_qparego, train_obj_qparego
    # compute hypervolume
    mll_qehvi, model_qehvi = initialize_model(train_x_qehvi, train_obj_qehvi)

    #compute pareto front
    pareto_mask = is_non_dominated(train_obj_qparego)
    pareto_y = train_obj_qparego[pareto_mask]

   #compute hypervolume
    volume = hv.compute(pareto_y)

    hvs_qparego.append(volume)
    hvs_qehvi.append(volume)
    hvs_random.append(volume)

    outputscales_qehvi, noise_qehvi, lengthscales_qehvi = [], [], []
    outputscales_qparego, noise_qparego, lengthscales_qparego = [], [], []

    # run N_BATCH rounds of BayesOpt after the initial random batch
    for iteration in range(1, N_BATCH + 1):

        t0 = time.time()

        # fit the models
        fit_gpytorch_model(mll_qparego)
        fit_gpytorch_model(mll_qehvi)

        # store GP parameters
        outputscales_qehvi.append(model_qehvi.covar_module.outputscale.cpu().detach().numpy())
        noise_qehvi.append(model_qehvi.likelihood.noise.cpu().detach().numpy())
        lengthscales_qehvi.append(model_qehvi.covar_module.base_kernel.lengthscale.cpu().detach().numpy())

        outputscales_qparego.append(model_qparego.covar_module.outputscale.cpu().detach().numpy())
        noise_qparego.append(model_qparego.likelihood.noise.cpu().detach().numpy())
        lengthscales_qparego.append(model_qparego.covar_module.base_kernel.lengthscale.cpu().detach().numpy())

        # define the qEI and qNEI acquisition modules using a QMC sampler
        qparego_sampler = SobolQMCNormalSampler(num_samples=MC_SAMPLES)
        qehvi_sampler = SobolQMCNormalSampler(num_samples=MC_SAMPLES)

        # optimize acquisition functions and get new observations
        new_x_qparego, new_obj_qparego, _, _, _, _ = optimize_qparego_and_get_observation(
            model_qparego, train_obj_qparego, qparego_sampler, search_bounds, D1_d, D2_d, standard_bounds, N1, N2
        )
        new_x_qehvi, new_obj_qehvi, _, _, _, _ = optimize_qehvi_and_get_observation(
            model_qehvi, train_obj_qehvi, qehvi_sampler, search_bounds, D1_d, D2_d, dwell_vol, dead_vol, standard_bounds, N1, N2
        )
        new_x_random, new_obj_random, _, _, _ = generate_random_data(f, search_bounds, standard_bounds, BATCH_SIZE, N1, N2, dwell_vol, dead_vol)

        # update training points
        train_x_qparego = torch.cat([train_x_qparego, new_x_qparego])
        train_obj_qparego = torch.cat([train_obj_qparego, new_obj_qparego])

        train_x_qehvi = torch.cat([train_x_qehvi, new_x_qehvi])
        train_obj_qehvi = torch.cat([train_obj_qehvi, new_obj_qehvi])

        train_x_random = torch.cat([train_x_random, new_x_random])
        train_obj_random = torch.cat([train_obj_random, new_obj_random])

        # update progress
        for hvs_list, train_obj in zip(
                (hvs_random, hvs_qparego, hvs_qehvi),
                (train_obj_random, train_obj_qparego, train_obj_qehvi),
        ):
            # compute pareto front
            pareto_mask = is_non_dominated(train_obj)
            pareto_y = train_obj[pareto_mask]
            # compute hypervolume
            volume = hv.compute(pareto_y)
            hvs_list.append(volume)

        # reinitialize the models so they are ready for fitting on next iteration
        # Note: we find improved performance from not warm starting the model hyperparameters
        # using the hyperparameters from the previous iteration
        mll_qparego, model_qparego = initialize_model(train_x_qparego, train_obj_qparego)
        mll_qehvi, model_qehvi = initialize_model(train_x_qehvi, train_obj_qehvi)

        t1 = time.time()

Error messages

Traceback (most recent call last):
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 27, in psd_safe_cholesky
    L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: For batch 0: U(8,8) is zero, singular U.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Multi_output_BO.py", line 376, in <module>
    model_qehvi, train_obj_qehvi, qehvi_sampler, search_bounds, D1_d, D2_d, dwell_vol, dead_vol, standard_bounds, N1, N2
  File "Multi_output_BO.py", line 183, in optimize_qehvi_and_get_observation
    sequential=True,
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/botorch/optim/optimize.py", line 157, in optimize_acqf
    options=options,
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/botorch/optim/initializers.py", line 114, in gen_batch_initial_conditions
    X_rnd[start_idx:end_idx].to(device=device)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/botorch/utils/transforms.py", line 198, in decorated
    return method(cls, X, **kwargs)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/botorch/utils/transforms.py", line 169, in decorated
    return method(cls, X, **kwargs)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/botorch/acquisition/multi_objective/monte_carlo.py", line 270, in forward
    posterior = self.model.posterior(X)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 301, in posterior
    mvn = self(X)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/models/exact_gp.py", line 319, in __call__
    predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 317, in exact_prediction
    self.exact_predictive_mean(test_mean, test_train_covar),
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 335, in exact_predictive_mean
    res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 284, in mean_cache
    mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 963, in inv_matmul
    return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 51, in forward
    solves = _solve(lazy_tsr, right_tensor)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 15, in _solve
    return lazy_tsr.cholesky()._cholesky_solve(rhs)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 750, in cholesky
    chol = self._cholesky(upper=False)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 419, in _cholesky
    cholesky = psd_safe_cholesky(evaluated_mat, jitter=settings.cholesky_jitter.value(), upper=upper).contiguous()
  File "/home/jimboelrijk/anaconda3/envs/LCLC/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 51, in psd_safe_cholesky
    f"Matrix not positive definite after repeatedly adding jitter up to {jitter_new:.1e}. "
gpytorch.utils.errors.NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06. Original error on first attempt: cholesky_cpu: For batch 0: U(8,8) is zero, singular U.

System Info

Please provide information about your setup, including

saitcakmak commented 3 years ago

I ran into this several times myself. IIRC, it is a numerical issue that is caused by your training samples. I think in my case this was happening with a test function that would take values ~70k in a tiny region, and was relatively flat around 2-5k for the rest of it. If you're in a similar case, you may want to truncate the extreme values from your function. I'd first try using dtype=torch.double though, that typically helps with numerical issues.

Balandat commented 3 years ago

@Jimbo994 wha's the value of noise_qehvi that you see when you see these failures? My suspicion is that it's going to be very small. One way to mitigate the numerical issues is by putting a lower bound on the inferred noise level (the default is specified here: https://github.com/pytorch/botorch/blob/master/botorch/models/gp_regression.py#L111-L115). I would also be interested in the lengthscale parameters when things fail to see if there is any degenerate behavior there.

Jimbo994 commented 3 years ago

Thanks for the swift reply (as always). I looked into both your comments. I will discuss @saitcakmak first. So for the problem I am studying at the moment, I still make use of a simulator (but will shift to real experiments in the near future), which means I can also run grid searches to see what my optimization space looks like. Below you can see the objective scores for the grid search (some grid on 8 input parameters)

image

I am currently optimising on "time score" and on "lowest resolution". I have constructed the time sore in such a way the negative values are set to 0. As you can see most samples in the grid search actually have a time score of 0, so I guess this could have created very flat areas, which might be hard to optimize on. I did this because it can happen that values become really negative with certain parameters and want to have this contained. I decided to remove this and then it looks as follows (just for a particular sample).

image

I also found out I had a bug in my minimum resolution objective, which looks friendlier to optimize on. I am now running optimizations to see if this solves my issue.

@Balandat, this is what the noise and lengthscales look like for a failure.

image

Printed noise: noise1 noise2 0 0.168521 0.170675 1 0.000100 0.000100 2 0.000100 0.000100 3 0.000100 0.000100 4 0.000100 0.000100 5 0.000100 0.000100 6 0.000100 0.000100 7 0.000100 0.000100 8 0.000100 0.000100 9 0.000100 0.000100 10 0.000100 0.000100 11 0.018811 0.000100 12 0.017884 0.000100 13 0.016808 0.000100 14 0.019829 0.000100 15 0.018676 0.000100 16 0.017911 0.479345 17 0.013851 0.380675 18 0.014870 0.381921 19 0.015542 0.385278 20 0.017296 0.359735 21 0.018725 0.368275 22 0.021096 0.367826 23 0.021235 0.363038 24 0.020285 0.344205 25 0.018881 0.327917 26 0.017005 0.311616 27 0.000100 0.044879 28 0.015189 0.312539 29 0.000100 0.016099 30 0.009768 0.016696 31 0.009992 0.016386 32 0.000100 0.066663 33 0.000100 0.066909 34 0.000100 0.063174 35 0.009980 0.081829 36 0.000100 0.078536 37 0.009565 0.093238

So indeed GP noise is very low. But since I am using a simulator, data is pretty much noiseless. For this reason, I initially was using horseshoe priors for noise, but actually moved back to the default to hope to resolve the issue.

For a non-failing run the noise and lengthscale looks like this. image

Anyhow I will rerun optimisations over the weekend with the changes to my objectives to see if the problem is solved or not.