scikit-learn-contrib / scikit-matter

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities
https://scikit-matter.readthedocs.io/en/v0.2.0/
BSD 3-Clause "New" or "Revised" License
76 stars 20 forks source link

PCovR is not centering like PCA #155

Closed agoscinski closed 1 year ago

agoscinski commented 1 year ago
import numpy as np
from skcosmo.decomposition import PCovR
from sklearn.decomposition import PCA

np.random.seed(0)

X = np.random.rand(10,3)+np.array([1,1,1])
y = np.random.rand(10,1) 

# transformation on non centered X
pcovr = PCovR(mixing=1).fit(X, y)
pca = PCA().fit(X)
X_povr_t = pcovr.transform(X)
X_pca_t = pca.transform(X)

# transformation on centered X
X -= np.mean(X,axis=0)[None,:]
pcovr = PCovR(mixing=1).fit(X, y)
pca = PCA().fit(X)
X_povr_t_c = pcovr.transform(X)
X_pca_t_c = pca.transform(X)

print("Difference in transformation between noncentered and centered data")
print("sklearn PCA", np.linalg.norm(X_pca_t-X_pca_t_c))
print("skcosmo PCovR (alpha=1) = PCA", np.linalg.norm(X_povr_t-X_povr_t_c))
Out:
Difference in transformation between noncentered and centered data
sklearn PCA 1.3922169130898628e-15
skcosmo PCovR (alpha=1) = PCA 2.3242628802840968

Computing the mean here https://github.com/lab-cosmo/scikit-cosmo/blob/7ef05ebc73e5ef016d53f3aee1069333c07a9933/skcosmo/decomposition/_pcovr.py#L271 but it is never centering the features

EDIT: this does not affect any results, because we always use the Standardizer before

rosecers commented 1 year ago

This is intentional behavior, as standardizing for pcovr is different than for PCA. We do not want automatic standardization.

agoscinski commented 1 year ago

You also need centering for PCovR otherwise your covariance matrix is wrong. I can see that you don't want to center inside the class, but you should at least print a warning that the data is not centered and will give wrong results.

Also you add the mean in the transform function, so there is some inconsistency here, even you completely remove it or you add it in both places.

agoscinski commented 1 year ago

Solved in #159