A Bug at the `inverse_transform` of the `KernelPCA`

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

59.34k stars 25.23k forks source link

A Bug at the `inverse_transform` of the `KernelPCA` #18902

Closed kstoneriv3 closed 3 years ago

kstoneriv3 commented 3 years ago

I have a question about the following line of KernelPCA. This line appears in the reprojection of the sample points in KernelPCA but for me, this line seems unnecessary. Here, kernel ridge regression is used to (approximately) reproject the "transformed samples"(, which are coefficients of principal components), using precomputed dual coefficients of the kernel ridge regression. Though we need to add alpha to the diagonal elements of the kernel matrix when computing the dual coefficients (as in _fit_inverse_transform), we do not need to add alpha to the kernel in the prediction stage, as far as I understand.

https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/decomposition/_kernel_pca.py#L362

Is this line really necessary? If so, I would appreciate it if someone could explain why this line is needed.

kstoneriv3 commented 3 years ago

I found this line is added in #16655 but I still do not see the point of the above line. ~~@lrjball Could you explain why you added this?~~

Now I am concerned about the test code below as well. As the inversion using kernel ridge regression is only an approximation, reconstructed input is not necessarily equivalent to the original input. So I suggest undoing #16655. https://github.com/scikit-learn/scikit-learn/blob/63d2bbf1ef187d026641514cf511648cedf94701/sklearn/decomposition/tests/test_kernel_pca.py#L291

kstoneriv3 commented 3 years ago

For clarification, I will mention some theoretical points. Let X be the input matrix of shape [n_sample, n_feature] and the X_tr be the transformed input maxtix of shape [n_samples, n_components]. X_tr contains the coefficients of the principal components obtained by kernel PCA. Then, proper formula for the prediction of the input X from transformed input Xtr using kernel ridge regression becomes <img src="https://render.githubusercontent.com/render/math?math=\hat y^{\text{new}} = K{X^{\text{new}}{tr}, X{tr}} {(K{X{tr}, X{tr}} %2B \alpha I)}^{-1} X">. However, the #16655 modified this to be <img src="https://render.githubusercontent.com/render/math?math=\hat y^{\text{new}} = (K{X^{\text{new}}{tr}, X{tr}} %2B \alpha I) {(K{X{tr}, X_{tr}} %2B \alpha I)}^{-1} X"> (only if the sample size of X and X^{new} are the same). Though this makes the reconstruction of the input match exactly with the input itself if it is a training sample, this is not a proper formula. Additionally, if the number of the samples are different between the training data of the kernel ridge regression with the test data, K.flat[::n_samples + 1] += self.alpha does not serve as the addition of alpha to the diagonal elements anymore.

lrjball commented 3 years ago

Hey @kstoneriv3, good spot! You are right - the kernel being rectangular is a giveaway that something is not right here. I do think the original bug does need a fix though, although the fix may be to add back on the mean somewhere (I seem to remember going down this route first), as previously the inverse transform always returned data with zero mean. For example, on scikit-learn==0.22:

from sklearn.datasets import make_blobs
from sklearn.decomposition import KernelPCA
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=100, centers=[[1, 1]], random_state=0)
X_train, X_test = X[:90], X[90:]
kp = KernelPCA(n_components=2, fit_inverse_transform=True)
kp.fit(X_train)
X_trans = kp.transform(X_test)
X_inv = kp.inverse_transform(X_trans)

plt.scatter(*zip(*X_test))
plt.scatter(*zip(*X_inv))
for x_inv, x_test in zip(X_inv, X_test):
    plt.plot([x_inv[0], x_test[0]], [x_inv[1], x_test[1]], 'k--', alpha=0.5)
plt.show()

kernelpca_bug

I can take a look at this unless you want to have a go?

kstoneriv3 commented 3 years ago

@lrjball Thank you so much for taking the trouble to provide a clear example! I now see that your fix surely reduces the reconstruction error. However, at the moment, I do not see why your fix works so well. I will think for a while and probably get back to you soon.

kstoneriv3 commented 3 years ago

I haven't figured out the cause of the zero-mean issue, but #16655 is not a theoretically proper way to fix I guess.

Theoretically, #16655 added the one-hot vector representation of the training sample (scaled by α) to the feature space. When we write the original kernel feature of data x_i as φ_i, then it is changed to φ^{new}_i = (φ_i, e_i / √α) so that (φ^{new}_i)^T (φ^{new}_j) = k(x_i, x_j) + α * δ( i - j ), where δ(・)is a Dirac's delta function.

Using this idea, you can also fix the issue of zero means by assigning randomly picked one-hot vector e_i to the transformed input X as following in KernelPCA.inverse_transform. But believe this is not a preferable way of fixing it.

def inverse_transform(self, X):
    K = self._get_kernel(X, self.X_transformed_fit_)
    n_X_fit = self.X_transformed_fit_.shape[0]
    n_X = X.shape[0]
    K[np.arange(n_X), np.random.randint(0, n_X_fit, n_X)] += self.alpha  # randomly assign one hot vector 
    return np.dot(K, self.dual_coef_)

Anyways, I will get back to you again when I see where the zero mean issue is caused.

kstoneriv3 commented 3 years ago

Finally, I figured out the cause of this issue. In fact, it is not a bug for KernelPCA.inverse_transform to reproduce the zero-mean reconstruction when the kernel is set "linear"; it is expected behavior.

This problem happens only when the linear kernel is used. Indeed, if you use KernelPCA(kernel="rbf"), zero-mean problem is solved.

When the linear kernel is used, the model is fitted to the shifted data whose mean is zero. (This shifting happens in the K = self._centerer.fit_transform(K) of KernelPCA._fit_transform.) In the reconstruction stage, linear ridge regression has to find the map from X_transformed to X, but X_transform only contains information of the shifted zero-mean data so there is no way to reconstruct the mean of the X.

So, probably a good way to address this issue would be to add a mean shift when the linear kernel is used.

def inverse_transform(self, X, mean_shift=True):  

    K = self._get_kernel(X, self.X_transformed_fit_)
    X_reconstructed = np.dot(K, self.dual_coef_)

    # Apply mean shift to keep the compatibility with `sklearn.decomposition.PCA`
    if (self.kernel=="linear") and mean_shift:
        X_reconstructed += self.X_fit_.mean(axis=0, keepdims=True)

    return X_reconstructed

@lrjball Do you agree with this explanation? If you do, I will make a PR

lrjball commented 3 years ago

Very interesting. So what is it about the linear kernel that is so special? Is it just that the method is inherently an approximation but can be made more precise in the special case? Because all of the kernels are zero meaned in the same way so without digging into it I am missing why the linear kernel is the only one that broke.

Also, worth bearing in mind that kernel='poly', degree=1,coef0=0 is equivalent to the linear kernel, so should probably be handled in the same way whatever happens.

kstoneriv3 commented 3 years ago

@lrjball

So what is it about the linear kernel that is so special? ... Because all of the kernels are zero meaned in the same way so without digging into it I am missing why the linear kernel is the only one that broke.

If we used a linear kernel without centering, and if the n_components >= n_features, the reconstruction would be exact.

The centering operation for linear kernel changes the original kernel matrix X^T X to (X-１μ)^T (X-１μ), where X is of shape [n_samples, n_features], μ is the mean row vector of X,１ is a column vector with all elements being one. Thus, the information of feature mean μ is lost by centering, because one cannot estimate μ only from the value of (X-１μ)^T (X-１μ).

This is also the case for non-linear kernels, but in such cases, we lose the information of the mean feature 1/N Σ_n Φ(x_n). In special cases like the linear kernel, this mean of feature coincides with the sample mean of the X.

Is it just that the method is inherently an approximation but can be made more precise in the special case?

I would assume the method is always inexact in that it always loses the information of the feature mean. However, if we do away with centering, the method can be exact in the case n_components == n_samples.

Also, worth bearing in mind that kernel='poly', degree=1,coef0=0 is equivalent to the linear kernel, so should probably be handled in the same way whatever happens.

Good point! I totally agree.

lrjball commented 3 years ago

@kstoneriv3 makes sense. Well if you think you have identified the fix in the code then I think a PR would be a good idea, I'm sure others will be able to give you some good feedback on it as well after that

Quarlos commented 11 months ago

@kstoneriv3 @lrjball

Both in the examples above and in the _kernel_pca.py code, I see expressions of the form: K = self._get_kernel(X_transformed)

where X_transformed is the projection of the fitted data on the kernel principal components. These are the coordinates of the projected vector in the higher dimensional feature space. However, the kernel is not defined for vectors on feature space but on input space. Even though we can write the kernel as k(x_i,x_j) = <phi(x_i), phi(x_j)>

for some feature map phi, the feature map is not generally available and, indeed, the only thing that the code knows about is k as a function of input vectors x_i, x_j.

I get it that _get_kernel might work even if we pass it vectors on the wrong space, particularly if it only depends trivially on the Euclidean norm of the vectors or their inner-product, but if we use a kernel that explicitly depends on the dimension of the input space (like the Mahalanobis distance), then it becomes clear this won't work.

Could someone please clarify why it makes sense here to compute the kernel on tranformed/projected vectors in feature space?

kstoneriv3 commented 11 months ago

@Quarlos Hi, I did not have time to check the code and the references so I may be wrong, but it is just a rough approximation, as far as I remember. The kernel PCA does not have a natural extension of the projection in the linear PCA, and this reconstruction technique in the kernel PCA does not have an equally solid theoretical background as the projection of the linear PCA. Here, we are using kernel ridge regression with multiple dimensional outputs, but we could (in theory) use any supervised learning method to learn the mapping from the low-dimensional representation obtained by the kernel PCA to the original representation (i.e. pre-image). This idea was introduced in the following paper so it may contain more intuitions on how it works.

Bakır, G. H., Weston, J., & Schölkopf, B. (2004). Learning to find pre-images. Advances in neural information processing systems, 16, 449-456.

Quarlos commented 11 months ago

Thank you @kstoneriv3. Yes, the solution detailed in the paper — as referenced in the code as well — is indeed an approximation. It’s actually very neat, all that it uses is linear algebra, bypassing many problems if we do this by gradient descent.

As far as I understand the paper, it seems to me that the sklearn implementation is doing something different. This is the relevant part of the implementation in sklearn:

  def _fit_inverse_transform(self, X_transformed, X):
        n_samples = X_transformed.shape[0]
        K = self._get_kernel(X_transformed)
        K.flat[:: n_samples + 1] += self.alpha
        self.dual_coef_ = linalg.solve(K, X, assume_a="pos", overwrite_a=True)
        self.X_transformed_fit_ = X_transformed

def inverse_transform(self, X):
    K = self._get_kernel(X, self.X_transformed_fit_)
        return np.dot(K, self.dual_coef_)

where X_transformed is the projected point in the higher dimensional space (used for learning the map). What this seems to want to do is:

learn coefficients as w=(K+aI)^{-1}X
… I can’t actually make sense of the second step

What is strange to me is how K is computed both in 1 and 2:

1. K = self._get_kernel(X_transformed)
2. K = self._get_kernel(X, self.X_transformed_fit_)

X_transformed is a point in the higher dimensional space (also for X in 2.). But the kernel K is not defined there in this way. Let me write an example, using the rbf kernel on a d-dimensional input space:

  (x,y) in R^d x R^d —> k(x,y) = exp(-gamma*||x-y||^2)

where ||x-y||^2 is the square of the Euclidean norm in R^d. We can write a feature map phi for this kernel, but the whole point of the kernel is that we don’t need to know about it. X_transformed is essentially the decomposition of phi(X) in the eigenbasis of the covariance matrix in the feature space (for X a point in input space). So, when you do self._get_kernel(X_transformed) on rbf you’re essentially doing:

   exp(-gamma*||phi(x)-phi(y)||^2)

Of course, this still works because phi(x) also has an Euclidean norm, but I don't think it is correct. When the kernel is given in this way, it is only defined on input vectors. Yes, underneath the kernel we have:

   k(x,y) = <phi(x), phi(y)>_H

but that inner-product is an inner-product on the RKHS H, and that's not trivial to compute.

Based on the paper, I feel the right implementation is:

  def _fit_inverse_transform(self, X_transformed, X):
        n_samples = X.shape[0]
        K = self._get_kernel(X)
    K2 = np.matmul(K.T, K)  # K^T.K
        K2.flat[:: n_samples + 1] += self.alpha
    b = linalg.solve(K2, np.matmul(K,X), sym_pos=True, overwrite_a=True)
        self.dual_coef_ = linalg.matmul(linalg.pinv(X_transformed), np.matmul(K,b))
        self.X_transformed_fit_ = X_transformed

def inverse_transform(self, X):
        return np.matmul(X, self.dual_coef_)

learn coefficients as: beta = (K^K+a.I)^{-1}.K.X —> w = phi(X)^{-1}.K.beta (phi(X) = X_transformed)
get approximate inverse image: X.w (X here is any point in feature space)

Let me know if it makes sense.

kstoneriv3 commented 11 months ago

I am not sure I fully understood your point but I think it's useful to point out the following points:

The argument X in inverse_transform is confusing, but it is a low-dimensional representation obtained by KPCA and it is not the data in the original space. This tricky argument name comes from the sklearn's naming convention for the inverse_transform method, I believe.
The choice of the kernel for learning the inverse map could be different from the kernel you use in the KPCA. Or you could even use other supervised learning models other than kernel ridge regression such as NNs, as I mentioned earlier. The choice of the kernel in inverse transform is very likely to be not optimal but for convenience, we are using the current default.

Quarlos commented 11 months ago

The choice of the kernel for learning the inverse map could be different from the kernel you use in the KPCA! This is what I was missing, thanks! The sklearn implementation makes now sense to me. What was confusing me is that the same kernel is used for two different things: (1) KPCA (2) learning pre-image.

For the predefined kernels you will never encounter any issues with using the same kernel. But I was trying a callable kernel of the form:

    exp(-1/2(x-y)V^{-1}(x-y))

where V is the covariance matrix of the input data. That gets harcoded and V^{-1}*(x-y) will only work if V^{-1} and (x-y) have matching dimensions. When this kernel gets called for the pre-image learning, as per the implementation, V will still be the covariance matrix wrt to the input data, but the x’s are now objects that live in some space with non-matching dimension. The code breaks.

One way to make this work is to have KernelPCA accept two kernels for these two different purposes. I’ve extended my local code to do that and everything works smoothly.

kstoneriv3 commented 11 months ago

Ah, yes. Use of a custom kernel with inverse_transform is indeed a corner case the current code fails to cover!