Issue in score function for specific dataset

@jiayilei discovered that under specific circumstances the variance in the score function initVar (defined here) can become negative.

@JCatwood, believes that chol[i, :, :] in Line 758 should match that torch.linalg.cholesky(self.transport_map_kernel.kernel_fun(X, theta, sigma_fun(i, theta, scal), smooth, nugMean[i]).squeeze()) and has observed that these values are not matching.

Already the original function uses the pre-computed chol value. This did not change in the new implementation. https://github.com/katzfuss-group/batram/blob/a607e75247fa04170dfa653005354c1c5fba9bc8/src/batram/legacy/fit_map.py#L242-L280

If this is wrong, it has been since the beginning. I do not know how it is in the R implementation.

What are the reasons for the differences observed?

krn = torch.linalg.cholesky(self.transport_map_kernel.kernel_fun(X, theta, sigma_fun(i, theta, scal), smooth, nugMean[i]).squeeze())
@JCatwood: use chol(krn)
chol[i, :, :] equals chol(krn + I)

Edit: The code @JCatwood inspected is not the current version and outdated at least since fa29406ad3ba150d7b32a57ead29a35e42dc4db2. However, the change should not affect how chol[i, :, :] is defined in mathematical terms but might have a different implementation now.

I believe, the current implementation is correct as $G_i$ as defined in Proposition 1 includes the added diagonal entires and is used to calculate the variance $d$ as a function of $\tilde \beta = (G^{-1/2}y)^T (G^{-1/2}y) = y^T G^{-1} y$. In the implementation we use $G^{1/2} = chol(G) =$ chol[I, :, :].

Closed issue by mistake when writing a reply. Moving those contents into a new reply behind the reopened issue.

Mathematically, Equation (11) in the inserted image is a nonnegative function if K is a positive definite covariance function. However, this need not be the case numerically—even if one uses Cholesky factors to stabilize the computations. And this remains the case whether we use single or double float precision.

A demonstration of this claim can be seen in the following code. I've purposefully made this fail, but we can arbitrarily choose a more dense set and retrieve the same results for any kernel at any parameterization.

import torch
from gpytorch import kernels

@torch.no_grad()
def kriging_variance(kernel, train_x, x_new):
    """Compute the variance in Kriging."""

    # Get the Cholesky factor over the training data
    k0 = kernel(train_x, train_x)
    chol = torch.linalg.cholesky(k0, upper=False)

    # Compute the other variance components
    k1 = kernel(train_x, x_new)
    k2 = kernel(x_new, x_new)

    # Compute the inversion
    v = torch.linalg.solve_triangular(chol, k1.evaluate(), upper=False)

    return k2 - v.square().sum(0, keepdim=True)

def run_experiment(dtype):
    # Set dtype
    torch.set_default_dtype(dtype)

    # Create fake input data
    x_train = torch.linspace(0, 1, 100).unsqueeze(-1)
    x_new = torch.linspace(0, 1, 5).unsqueeze(-1)

    # Define a kernel
    kernel = kernels.RBFKernel()
    kernel.initialize(lengthscale=0.6)

    # Compute variance
    var = kriging_variance(kernel, x_train, x_new)
    negatives = var.evaluate() < 0
    return torch.mean(negatives.float())

float32_case = run_experiment(torch.float32)
float64_case = run_experiment(torch.float64)

print(float32_case.item(), float64_case.item())

All of the variances are negative in the float32 case, and 80% of them are negative in the float64 case. In those two cases, the condition numbers of k0 and k2 are

(float32) 10.5422 and 4.1454
(float64) 20.0355 and 4.1454

The problem for us is to decide how we are going to assure that properties of the conditional variance are numerically consistent with its mathematical properties. For this we will need to clamp the min values of the variance. Examples are available in scikit-learn and gpytorch (through their linear_operator library). I tried finding similar implementations in GPflow (a tensorflow-based library for GP models), but could not quickly find how they handle this case in their source code.

There are additional considerations in our code. First, we assume in the prior that the conditional variance is a decreasing function in the spatial index. That implies we should not simply clamp conditional variances to be exactly zero (also exact zeros may cause other issues in our code). Second, we use gpytorch kernels throughout our models. This means that there are problems we don't have to address for ourselves; namely by clamping distance matrices to be nonnegative (see in gpytorch, and similar examples in other packages).

There remain things we must check for equivalence between our new implementation and the legacy code. Namely, we need to be sure we use the same orderings, neighbors, and masks on nearest neighbors whenever we do Kriging in the score function. I've done some of this elsewhere in a private fork. We can move those tests into this code base now or write de novo tests for this issue (I prefer this, as my other code uses other refactors not implemented here yet).

@feji3769 and I looked yesterday for a moment at the paper and had the thought that areas without variation among replications might cause the numerical issue since $|y{c(i)} - y'{c(i)}| \approx 0$ among some of fields would lead to a numerically singular $G_i$. In the most extreme case when all distances are 0: $$\Rightarrow G_i \stackrel{\text{numerically}}{\approx} \theta^2_a 11^T / \ell_i^{\theta_k>0} + I \stackrel{\ell_i \to 0}{\approx} 11^T / \ell_i^{\theta_k>0}$$

Do we ever encounter those cases in the data?

computation-of-G_i

I've worked on reproducing the reported errors with inconsistent results. I'm using the latest branch (commit e2180b7) with the SimpleTM.score method and the legacy.fit_map.cond_samp function. The SimpleTM.score method in the current branch yields errors different from the ones reported to me originally by @jiayilei. The cond_samp function implemented by @JCatwood does not produce any errors when I score test datasets with it.

I must also point out that I've implemented my own kriging functions below which yield errors for a reason different than the ones reported already. These functions do scoring with tensors of shapes

the observed response variables train_response.shape = (N, n, 1)
the augmented response with neighbors augmented_train_response.shape = (N, n, m)
the new response fields of shape test_response.shape = (N, n*, 1)
the augmented new response with neighbors augmented_test_response.shape = (N, n*, m)

The augmentations are the nearest neighbors described as $Y_{c_m(i)}$ in Katzfuss and Schafer (JASA, 2023). These functions work once we precompute the kernels to make the kriging computations. A key difference between these functions and the ones implemented in batram currently are that these do parallelized kriging for $n^{\star}$ new spatial fields all at once.

Notice the _conditional_variance function below diagonalizes the predicted kernel function. This corresponds to making independent predictions. With these kriging functions I don't encounter any numerical issues in the datasets that raised this issue. If I don't diagonalize the conditional variance then the off-diagonal elements are sometimes negative valued with the reference data for this issue.

I believe we want to make independent predictions (using the diagonalized variances implemented below). I am not sure if we want to make the same choice for the mean function or not. @wiep, may we discuss this briefly the next time we meet?

import torch
from gpytorch import kernels

def _conditional_mean(
    chol: torch.Tensor, k1: torch.Tensor, y: torch.Tensor
) -> torch.Tensor:
    """Computes the conditional mean."""
    alpha = torch.linalg.solve_triangular(chol, y, upper=False)
    return k1.mT @ alpha

def _conditional_variance(
    chol: torch.Tensor, k1: torch.Tensor, k2: torch.Tensor
) -> torch.Tensor:
    """Computes the conditional variance."""
    v = torch.linalg.solve_triangular(chol, k1, upper=False)
    cond_var = k2 - v.square().sum(dim=-2, keepdim=True)
    if cond_var.ndim == 2:
        return cond_var.diag().diag()
    else:
        return cond_var.diagonal(dim1=-2, dim2=-1).diag_embed()

def kriging(
    kernel: kernels.Kernel, x: torch.Tensor, y: torch.Tensor, x_star: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]:
    """Computes the conditional mean and variance for kriging.

    Args:
    -----
        kernel: kernel function
        x: training points
        y: training targets
        x_star: test points
    """
    chol = kernel(x, x).cholesky().evaluate()  # type: ignore
    k1 = kernel(x, x_star).evaluate()  # type: ignore
    k2 = kernel(x_star, x_star).evaluate()  # type: ignore

    mean = _conditional_mean(chol, k1, y)
    variance = _conditional_variance(chol, k1, k2)
    return mean, variance

katzfuss-group / batram

Issue in score function for specific dataset #20