Closed npielawski closed 1 year ago
I see you are using float32
data type - this is often problematic when working with GP models b/c the involved covariances (as you painfully discovered) are often close to singular. My first suggestion would be to switch over to use torch.double
data type for all your data, that should hopefully resolve a lot of these numerical issues (it's just a lot easier to get into numerically troublesome territory with single precision). Let me know if that doesn't help!
Thanks for your reply!
I switched to float64
now, but it still doesn't work.
I changed the following tensors:
They are in this format from the beginning of the experiment to the end, no casting in between.
I also checked that the covariance that failed is indeed a float64
and it is. The determinant of that covariance matrix was -5.019655123536972e-17, which is very close to zero, I don't know if that plays a role as well besides the â once again â single negative eigenvalue that prevents the cholesky decomposition (here -1.43235979e-04, not too close to 0).
Could you post a full repro of this behavior? I can see that this is happening when sampling the max values during the computation of the acquisition function, but it's hard to tell what kind of settings are being used without the actual code.
Basically what happens in qMultiFidelityLowerBoundMaxValueEntropy
is that we draw posterior samples of the latent function at a relatively dense grid to produce approximate samples of the max value - this does not involve the noise level of the likelihood, so if the lengthscales of the model are large, the discrete sampling locations are close by (possibly b/c there are many points) then it's not too surprising that this may run into some numerical issues.
One quick thing you could try is increasing the maximum amount of jitter that gpytorch attempts in psd_safe_cholesky
:
with linear_operator.cholesky_jitter(double=1e-6):
...
or
with linear_operator.cholesky_max_tries(5):
...
The first of the two sets the starting jitter value for double tensors to 1e-6, the second increases the number of tries to 5 (in each try the jitter value is increased by an order of magnitude).
(for more context, these are being invoked here: https://github.com/cornellius-gp/linear_operator/blob/main/linear_operator/utils/cholesky.py#L28-L31)
Doing so will cost some precision, but it may be ok if this is happening only occasionally. If this is still an issue then it may be worth considering some other approaches (e.g. using RFF/Decoupled sampling for generating max value samples) or potentially using a different acquisition function.
I looked more into it and I could fix the issue in the end. The mistake was mine in the end. The kernel I used was correctly implemented but later proven wrong in a new publication (not PSD if the lengthscale is too big) that showed the proper way of computing it. Using double precision is necessary, however. Thanks for your help!
đ Bug
I am running a multi-fidelity bayesian optimization to find a specific quaternion that maximizes a score function using qMultiFidelityLowerBoundMaxValueEntropy. In order to do that I had to warp a Matérn kernel to account for the space of quaternions. I doubled check the equations and even tried with the regular Matérn kernel, but I always get the NotPSDError exception. I can get the exception during the fitting of the parameters or during the acquisition function search. I know that my score is deterministic so I used a FixedNoiseMultiFidelityGP with a noise of 1e-6 (maybe that's too small? I tried 1e-4 and 1e-5 too).
I noticed that sometimes the learned parameters are too small (e.g. outputscale (ScaleKernel), or lengthscale (MaternKernel) is 1e-10) and I believe that can be a source of NotPSDError, so I constrained the parameters to be >1e-2/1e-3.
I tried to remove the LinearTruncatedFidelityKernel so that's not causing the problem either. I don't think the problem is coming from the individual kernels per se since removing/replacing them doesn't solve the issue.
I changed the source code of cholesky.py so that it saves the covariance function when an exception is thrown. When the fitting failes, as an example, the covariance is a 28x28 float32 matrix that unfortunately has negative eigenvalues. Here (cov_fail_fitting.npz) is some properties of the covariance matrix (failed during fitting):
And a plot of the covariance matrix:
Another example is a failure during the aquisition function search (cov_fail_acq.npz). The matrix is a 3x3:
The matrices are available here: covariance_matrices.zip
I am not sure where to look now in order to debug and find what causes the eigenvalues to be negative. I saw on different issues that it can be due to high dimensional searches but quaternions are only 4 dimensional, and the cube representation is only 3D (since unit quaternions can be represented with 3 scalars). Any suggestion or ideas is helpful. Thanks!
To reproduce
Code snippet to reproduce The code base is pretty big, feel free to ask if you need any other piece of code.
Stack trace/error message
System information
Please complete the following information: