GRCCA gives quite different results with original implementation - Githubissues

jameschapman19 / cca_zoo

Canonical Correlation Analysis Zoo: A collection of Regularized, Deep Learning based, Kernel, and Probabilistic methods in a scikit-learn style framework

https://cca-zoo.readthedocs.io/en/latest/

MIT License

193 stars 41 forks source link

GRCCA gives quite different results with original implementation #163

Open JohannesWiesner opened 1 year ago

JohannesWiesner commented 1 year ago

Hi James, similar to #107, I implemented a test to check if cca_zoo.models.GRCCA gives similar results as the original implementation. But the feature weights are quite dissimilar especially for the much smaller y input array:

Here's the testing code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Sanity check if cca-zoo's implementation of GRCCA gives the same result
as the original implementation'

Requires an environment with:
- cca-zoo 
- r-base (conda install -c conda-forge r-base)
- r-devtools (conda install -c conda-forge r-devtools)
- RCCA (installed via: devtools::install_github('https://github.com/ElenaTuzhilina/RCCA')
- rpy2 (conda install -c conda-forge rpy2)

Useful links:
- https://github.com/murraylab/gemmr/blob/master/gemmr/estimators/r_estimators.py
- https://github.com/Teekuningas/sparsecca/blob/master/tests/test_compare_pmd_to_r.py

@author: johannes.wiesner

"""

import numpy as np
rng = np.random.RandomState(42)
from cca_zoo.models import GRCCA
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
import matplotlib.pyplot as plt

###############################################################################
# Get original data  and parameters as in original example
###############################################################################

robjects.r(
'''
library(RCCA) 
data(X)
data(Y)
n.groups = 5
group1 = rep(1:n.groups,rep(ncol(X)/n.groups,n.groups))
group2 = rep(1,ncol(Y))
lambda1 = 1000
lambda2 = 0
mu1 = 100
mu2 = 0
''')

X = np.array(robjects.globalenv['X'])
y = np.array(robjects.globalenv['Y'])
group1 = np.array(robjects.globalenv['group1'])
group2 = np.array(robjects.globalenv['group2'])
lambda1 = robjects.globalenv['lambda1'][0]
lambda2 = robjects.globalenv['lambda2'][0]
mu1 = robjects.globalenv['mu1'][0]
mu2 = robjects.globalenv['mu2'][0]

###############################################################################
# Run original and cca-zoo implementation and return coefficients for X
###############################################################################

def test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng):

    # run original implementation
    RCCA = importr('RCCA')
    results_R = RCCA.GRCCA(X=X,
                           Y=y,
                           group1=group1,
                           group2=group2,
                           lambda1=lambda1,
                           lambda2=lambda2,
                           mu1=mu1,
                           mu2=mu2)

    # save original coefficients
    xcoefs = results_R.rx2('x.coefs')
    ycoefs = results_R.rx2('y.coefs')

    # save number of components because in the orignal implementation number
    # of components is always set to maximum using n.comp = min(ncol(X), ncol(Y), nrow(X))
    n_comp = results_R.rx2('n.comp')[0]

    # re-run with cca-zoo implementation
    # TODO: I think original implementation does not standardize?
    estimator = GRCCA(latent_dims=n_comp,
                      c=[lambda1,lambda2],
                      mu=[mu1,mu2],
                      random_state=rng,
                      scale=False,
                      centre=False)

    # in python, things start with 0, not with 1
    feature_groups = [np.int64(group1)-1,np.int64(group2)-1]
    estimator.fit([X,y],feature_groups=feature_groups)

    # save replicated coefficients
    X_weights,y_weights = estimator.weights

    return xcoefs,X_weights,ycoefs,y_weights

xcoefs,X_weights,ycoefs,y_weights = test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng)

###############################################################################
# Plot deviations
###############################################################################

for org,rep in [[xcoefs,X_weights],[ycoefs,y_weights]]:

    # plot distribution of deviations (ignore zeros for plotting purposes)
    plt.figure()
    deviations = np.abs(org)-np.abs(rep)
    deviations = deviations.flatten()
    deviations = deviations[deviations != 0]
    plt.hist(deviations)

###############################################################################
# Throw error if not close
###############################################################################

assert(np.allclose(np.abs(xcoefs),np.abs(X_weights)))
assert(np.allclose(np.abs(ycoefs),np.abs(y_weights)))

jameschapman19 commented 1 year ago

Looks like a scaling thing a bit like #162 - the correlations of the weights are >0.90 or higher. I'm pushing an update that will ensure that for cca_zoo models wXXw=n where n is the number of samples.

jameschapman19 commented 1 year ago

putting weights on the same scale

jameschapman19 commented 1 year ago

Would be good to get a bit closer though I'll take another look at the implementation.

jameschapman19 commented 1 year ago

ok think I've found the source of the error. Will let you know the results soon.

jameschapman19 commented 1 year ago

OK the source of this is basically a misalignment between RCCA lambda and cca_zoo c.

lambda gets added to the covariance matrices but c slides between 0 regularisation and 1 maximal regularisation (all ones on the diagonal).

JohannesWiesner commented 1 year ago

Would it make sense to include a testing folder somewhere in cca_zoo that checks original R-implementations against cca_zoos doppelgängers? I noticed that it's actually quite handy to just be able to import R-packages with RCCA = importr('RCCA'). Let me know if you're interested. I don't know anything about CI, but at least some folders for semi-automatic tests maybe?

jameschapman19 commented 1 year ago

Yeah definitely! Will almost certainly add this one in somewhere. Obviously there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them

jameschapman19 commented 1 year ago

I've used variants of your script to check that the source of the difference is lambda/c and convinced myself.

PRCCA with no regularisation (and fewer features than samples) everything matches

GRCCA with very high regularisation also matches

GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.

JohannesWiesner commented 1 year ago

Yeah definitely! Will almost certainly add this one in somewhere. Obviously, there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them

Let me know if I can help with a PR!

GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.

Ah okay. Do you still have to update cca_zoo.models.GRCCA then or is the deviation purely stemming from this bug?