ash shrinkage applied to correlation matrix

dariober commented 2 years ago

Hi- I'd like to ask your advice on the following situation.

I computed a 32x47 matrix of Pearson correlation coefficients and now I would like to apply shrinkage to these coefficients. If it matters, the histogram of coefficients is here:

For each coefficient I calculated the standard error using the formula se = (1-cor^2)/sqrt((N-2)).

Is it sensible to apply ash to the vector 1504 correlations? Is there anything obviously wrong? Any thoughts much appreciated!

willwerscheid commented 2 years ago

Hi @dariober ! This is more or less the idea implemented in Kushal Dey's CorShrink package, which is described here: https://www.biorxiv.org/content/10.1101/368316v2. Please have a look and let us know your thoughts!

stephens999 commented 2 years ago

can i ask why it is a "matrix" of correlation coefficients? It suggests some additional structure that maybe could be exploited by mashr (Urbut et al, 2019)?

dariober commented 2 years ago

@willwerscheid, @stephens999 thanks a lot guys it looks like I was going to reinvent the wheel and I feel a bit overwhelmed by options now!

can i ask why it is a "matrix" of correlation coefficients?

Here's more detail: For each of N=27 patients I have gene expression for 32 biomarkers and 47 genes. For each combination of biomarker and gene I calculate the correlation coefficient. So the matrix has dimensions 32 x 47 and each cell is the correlation. All this measured in tissue "BM".

In addition, for a 15 of these 27 patients I have the same biomarkers and genes measured in tissue "PBMC". So I have another 32 x 47 matrix of correlation coefficients. I was thinking to keep the data for the two tissues separate but I'm sure there are better ways of handling it.

As a side note: It appears CorShrink package has been removed from cran and archived.

stephens999 commented 2 years ago

ok, yes, i would start by applying the corshrink approach to each vector (formed from the matrix for each tissue ) in turn.

it could be interesting to look for patterns of "sharing" of effects across the biomarkers, but with data on only 47 genes that could be difficult to get reliable estimates....

If the biomarkers and/or genes are highly correlated with one another you might need to be careful in interpreting results from ashr as it assumes independence of the different measurements you give it.

dariober commented 2 years ago

Hello again - I experimented with the various options and here's what I got. This is considering only tissue "BM", roughly the same applies to PBMC.

Applying CorShrink::CorShrinkVector to the single vector of 1504 correlations results in all of the shrunk correlations to be 0. On the hand, this is not terribly surprising given that all raw correlations are between the narrow range -0.6 and +0.6. However, I find it strange that they collapse to exactly 0.

My method of using raw correlations and standard error as se = (1-cor^2)/sqrt((N-2)) passed to asher::ash(cor, se) shrinks less agressively. Panel cor_ash below shows the result (x-axis is the raw correlation).

tmp

I also tried mashr in two modes:

Using covariances estimated from the data as follows:

mashdat <- mash_set_data(
    Bhat= as.matrix(dcast(data= cordat[tissue == ts], biomarker ~ gene_name, value.var= 'cor_trans'), rownames='biomarker'),
    Shat= as.matrix(dcast(data= cordat[tissue == ts], biomarker ~ gene_name, value.var= 'cor_trans_sd'), rownames='biomarker')
    )
U.pca <- cov_pca(mashdat, 5)
U <- cov_ed(mashdat, U.pca)
m.c <- mash(mashdat, U)
pm <- get_pm(m.c)

Using canonical covariances:

mashdat <- mash_set_data(
    Bhat= as.matrix(dcast(data= cordat[tissue == ts], biomarker ~ gene_name, value.var= 'cor_trans'), rownames='biomarker'),
    Shat= as.matrix(dcast(data= cordat[tissue == ts], biomarker ~ gene_name, value.var= 'cor_trans_sd'), rownames='biomarker')
    )
U <- cov_canonical(mashdat)
m.c <- mash(mashdat, U)
pm <- get_pm(m.c)

In both cases, the input is the fisher-transformed correlations with standard errors as se = sqrt(1/(N - 1) + 2/(N - 1)^2). Results than back-converted to the -1:1 range. Results are in panels cor_mash_ed and cor_mash_canonical.

In the figure, each biomarker is coloured differently (so 32 colours!).

Going by gut feeling, cor_mash_ed seems the most sensible to me since it accounts for the fact that there are 32 groups (biomarkers) and 47 conditions (genes) in contrast to CorShrink and "my" solution that treat each of 1504 correlations as distinct entries. Am I right in this?

cor_mash_canonical looks odd in that correlations for many biomarkers collapse to basically the same value (the horizontal stripes of the same colour).

An awkward feature of using mashr here is that results depend on whether you use the 32x47 matrix or the transponse 47x32. Intuitively results shouldn't change, right? In practice though, I didn't see a lot of difference using the transpose.

Finally, there is the question of using the raw correlations and standard errors or the Fisher-converted correlations.

Any thoughts are very welcome and thanks again!

stephens999 commented 2 years ago

interesting.

I looked a bit more at the issue of Fisher-converged or raw correlations. The issue is that the standard deviation of the correlation coefficient (1-cor^2)/sqrt((N-2)) depends on the true correlation, which of course is unknown. And if the true correlation is null then plugging in the empirical correlation to this formula is going to lead to a systematic underestimate, and hence potentially overstate significance.

For this reason I suspect the transformed approach is preferable. In any case it may explain why the results from the transformed approach give stronger shrinkage. (The reason they get shrunk completely to 0 is probably as follows: if the data are consistent with the global null then the empirical bayes approach may estimate the prior to be a point mass at 0, and therefore shrink everything to 0.)

Applying mashr to the matrix X or its transpose X' are not making the same assumptions, so won't give the same answer. The mashr assumption is that rows are independent of one another, but columns can be correlated, and it estimates the correlations.

In your case I honestly worry about mashr because it really is best suited to "tall thin" matrices (ie many independent observations on a few correlated conditions). In typical applications the number of rows is many 1000s...

dariober commented 2 years ago

Thanks @stephens999 your explanation makes sense.

Now, my next step would be to apply some clustering and heatmap plotting to the correlation matrix to see if there are patterns of co-variation between genes and biomarkers. It's an approach that I don't particularly endorse since it makes it very easy to overinterpret streaks originating from random noise, anyway... I was hoping to get more meaningful results after shrinkage but if all correlations collapse to zero there is nothing you can do. I'll have a think on how to proceed, if you have any suggestions, please post!

stephens999 / ashr

ash shrinkage applied to correlation matrix #130