jameschapman19 / cca_zoo

Canonical Correlation Analysis Zoo: A collection of Regularized, Deep Learning based, Kernel, and Probabilistic methods in a scikit-learn style framework
https://cca-zoo.readthedocs.io/en/latest/
MIT License
190 stars 40 forks source link

Implement reordering function to assess feature significance? #130

Open JohannesWiesner opened 2 years ago

JohannesWiesner commented 2 years ago

In their paper from 2018, Xia et a. implemented a method to match canonical variates from resampled data sets to the original data set in order to be able to compute p-values for their canonical weights.

Page 12 (Methods Section):

As permutation could induce arbitrary axis rotation, which changes the order of canonical variates, or axis reflection, which causes a sign change for the weights, we matched the canonical variates resulting from permuted data matrices to the ones derived from the original data matrix by comparing the clinical loadings (v) (75. Mišić, B. et al. Network-level structure-function relationships in human neocortex. Cereb. Cortex 26, 3285–96 (2016).)

Here's the code they implemented to achieve this: https://github.com/cedricx/sCCA/blob/d5a2f4cb071bddd3f7d805e02ff27828b8494c66/sCCA/code/final/cca_functions.R#L191

Would it make sense to implement this method for cca-zoo? I am not even sure if this is a 'good' method having this issue in mind? But if I got it right, it's one thing to assess the overall significance of the canonical variates themselves and another thing to assess the significance of the feature weights on the canonical variates?

JohannesWiesner commented 2 years ago

Implementation in Python:

# compute the absolute correlation matrix between the original and resampled 
# v weights
def cor_matrix(X,Y):
    X = (X - X.mean(axis=0)) / X.std(axis=0)
    Y = (Y - Y.mean(axis=0)) / Y.std(axis=0)
    cor_matrix = np.dot(X.T, Y) / X.shape[0]
    return cor_matrix

def reorderCCA(v_org,v_res,u_org,u_res,cors_org,cors_res):

    cor_matrix_abs = np.abs(cor_matrix(v_org,v_res))
    res_match = np.argmax(cor_matrix_abs,axis=1)
    res_cor = np.amax(cor_matrix_abs,axis=1)
    res_match_count = len(np.unique(res_match))

    # now reorder the u and v weights from the resampled CCA based on the correlation matches
    u_res_ordered = u_res[:,res_match]
    v_res_ordered = v_res[:,res_match]
    cors_final = cors_res[res_match]

    # for each v variate get the 'mean' sign for both original and resampled v
    signs_org = np.sign(np.mean(np.sign(v_org),axis=0))
    signs_res = np.sign(np.mean(np.sign(v_res_ordered),axis=0))

    # compute element wise product
    signs_prod = (signs_org * signs_res).reshape(-1,1)

    # change the signs of u and v of the resampled cca
    u_final = (u_res_ordered.T * signs_prod).T
    v_final = (v_res_ordered.T * signs_prod).T

    # if you can't exclusively assign one resampled v to one original v, return
    # nans
    if res_match_count < u_org.shape[1]:
        u_na = np.empty(u_org.shape)
        u_na[:] = np.nan
        v_na = np.empty(v_org.shape)
        v_na[:] = np.nan    
        cors_na = np.repeat(np.nan,u_org.shape[1])
        res_match_na = np.repeat(np.nan,u_org.shape[1])
        res_cor_na = np.repeat(np.nan,u_org.shape[1])
        res_one_reorder = [u_na,v_na,cors_na,res_match_na,res_cor_na]
    else: 
        res_one_reorder = [u_final,v_final,cors_final,res_match,res_cor]

    return res_one_reorder
JohannesWiesner commented 2 years ago

I am not sure if this function makes sense. Xia et al. refer to this paper (Mišić et al.)

Matching Decompositions From Original and Resampled Matrices The decompositions of both permuted and bootstrap-resampled data matrices are not guaranteed to produce a set of LVs comparable with the ones derived from the original data matrix, because both types of resampling could induce arbitrary axis rotation (a change in the order of LVs) or axis reflection (a sign change for the weights). As a result, Procrustes rotations are used to match the resampled LVs to the original LVs (McIntosh and Lobaugh 2004).

But here Procrustes Rotations are used to match resampled to original variates. I am not sure if the function above from Xia et al. is Procrustes Rotation?

JohannesWiesner commented 2 years ago

I guess the whole issue could also be renamed to: "Implement Procrustes Rotation for resampling purposes (bootstrapping, permutation)".

For completeness:

Here's the implementation from pyls: https://github.com/netneurolab/pypyls/blob/d8a19d564cc5804249527b68c937df3a5fd8c7cc/pyls/compute.py#L240

This package might help: https://procrustes.readthedocs.io/en/latest/

Maybe also this: https://github.com/MaxHalford/prince#generalized-procrustes-analysis-gpa

JohannesWiesner commented 1 year ago

Any update on this? I might start to work on this with my colleagues on this :)

Also posted a question on this on CrossValidated

jameschapman19 commented 1 year ago

Apologies for never having responded! Must have caught me at a busy time. Not done anything myself, as ever welcome any contributions that are of practical use to people :)

jameschapman19 commented 1 year ago

Spent some time making sure everything works with scikit-learn permutation testing stuff so that might be relevant/helpful

JohannesWiesner commented 1 year ago

In a nutshell: If we are only considering one variate, we can simply take the absolute values of the loadings/weights. But once you got more variates, you may end up in situations where you have a different order of the variates but still similar weights (similar relationship of variables when only considering each variate in isolation). In this case, Procrustes rotation may help. Not sure, however, what some statisticians would say about this because a different order of variates might inherently also be an indicator of bad reliability.