Is averaged pairwise correlation the first principle to construct MCCA indices?

WantongLi123 commented 10 months ago

I have 3 data views and each have more than 10 variables. I use the MCCA function.

I check the averaged pairwise correlation and explained covariance of constructed indices. The 2rd pair of indices has the largest explained covariance, and the 3rd pair of indices has the largest averaged pairwise correlation.

I'm wondering, in principle, CCA should construct the largest averaged pairwise correlation and explained covariance through the 1st pair. Do you maybe know why it is not the case? Thanks in advance!

Bests, Wantong

I posted this question also here: https://stackoverflow.com/posts/77723778/edit But I don't have enough credit point to create a tag cca-zoo. So sorry for cross-posting but I'm afraid that my previous post does not notify you.

jameschapman19 commented 10 months ago

Trying to understand this question but struggling a bit (perhaps a language barrier or perhaps terminology).

In principle the first dimensions of the learnt representation of X,Y,Z call them Zx, Zy,Zz will have the highest average pairwise correlation (i.e. the average correlation of Zx[:,0] with Zy[:,0], Zx[:,0] with Zz[:,0], and Zy[:,0] with Zz[:,0]), the second dimensions will have the second highest (i.e. the average correlation of Zx[:,1] with Zy[:,1], Zx[:,1] with Zz[:,1], and Zy[:,1] with Zz[:,1]) etc.

The package tests for that principle. If that isn't the case for your data let me know and if possible share the data and I'd be happy/curious to see what's going on.

Explained Covariance is something I introduced here to understand the nature of correlated signals. High correlation + High covariance will generally be more robust than High correlation + Low covariance. But (M)CCA optimises for correlation not covariance therefore the first dimension may have a higher correlation and lower covariance than the second dimension. In test data (out of sample), it might even be the case that the first dimension has lower correlation.

jameschapman19 commented 10 months ago

I'd be most immediately surprised if the first dimension in your training data did not have the highest average correlation. The other observations with respect to explained covariance are entirely possible because a signal can be small but highly correlated versus big and less perfectly correlated.

WantongLi123 commented 10 months ago

Hi James, thanks for your reply, and help me to confirm that the highest average pairwise correlation is the the first principle to construct CCA indices.

I just tested to remove a variable in the X group, because this variable is generated using a same model framework as another variable in the Y group. Now I see the pairwise correlation of CCA indices go back to make senses (first-pair has the highest correlation).

I guess the high dependence of two variables in two groups could somehow bias the CCA algorithm?

jameschapman19 commented 10 months ago

Yeah I don't think it's biasing the algorithm so much as it perhaps makes some part of the solving process unstable.

The default behaviour is to apply PCA to the data, run MCCA by solving a generalized eigenvalue problem on the principal components, and then "undo" the PCA (this process has some nice properties and is mathematically equivalent).

It is possible that running MCCA(pca=False) on your original variables might also make the first dimension have highest correlation - not for any particular reason just because the solver might like it better.

jameschapman19 commented 10 months ago

also just checking we are referring to training/in-sample correlations as opposed to testing/out-of-sample correlations.

anything is possible out of sample!

WantongLi123 commented 10 months ago

:) thanks for the reminder!

jameschapman19 / cca_zoo

Is averaged pairwise correlation the first principle to construct MCCA indices? #190