Issue with the selection of random subsample

AlexSimis commented 5 months ago

Dear authors,

thank you for developing and maintaining this tool, it is a game-changer for my analysis.

So, the context of my study is two groups of individuals, control and exposed, quantified at 1000 proteins at two time points (similar to pre-post). Following a context by context approach I have mapped cis protein-QTLs for each of these four contexts (2 time point x 2 groups). My goal is to identify response pQTLs between time points for each group, therefore I have used mashr to jointly analyze the results of the mapping following the outline of the eqtl analysis that you provide (very helpful).

However, when I experimented with the size of the random subset, I found out that the results in terms of response pQTLs showed limited concordance. Specifically, given a selection of strong signals (top SNP per protein) I tried different sizes of random subsets e.g. 20.000 200.000 400.0000, etc to fit the model. Is this something to be expected? And given that there is a discrepancy among the results do you have any suggestions on how to choose the size of the random subset?

Thank you, Alex

stephens999 commented 5 months ago

This is not expected, no.

The larger the random subset the better. But results shouldn't really be so sensitive to the size as long as it is large. (The only reason not to use all tests is to reduce computation.)

If you can share your data and code we might be able to take a look to see why there is sensitivity....

pcarbo commented 5 months ago

@AlexSimis When you are saying this,

However, when I experimented with the size of the random subset, I found out that the results in terms of response pQTLs showed limited concordance.

Are you referring to the "cov_ed" step? I would expect that at the very least the covariances with the largest weights are similar across cov_ed runs with different subsets — perhaps not the covariances with smaller weights.

AlexSimis commented 5 months ago

Thank you both for your prompt response.

@stephens999 Sure! Should I share a GitHub repository?

@pcarbo I was not specifically referring to the "cov_ed" estimation but to the final pairwise sharing of signals from "get_pairwise_sharing", based on the posterior summaries of the strong data. Then I manually try to isolate the signals that are not shared to pinpoint which are likely to be response pQTLs based on lfsr and the magnitude of change. I should note here that the sharing of signals across contexts is indeed very high (~98% on average), indicating that I am after small effects.

However, I just checked the "cov_ed" between runs and the covariances have relatively small changes. My understanding is that the "cov_ed" will be influenced under the different estimated correlations (Vhat) from "estimate_null_correlation_simple" because this is dependent on the random subset, right?

Best, Alex

pcarbo commented 5 months ago

My understanding is that the "cov_ed" will be influenced under the different estimated correlations (Vhat) from "estimate_null_correlation_simple" because this is dependent on the random subset, right?

@AlexSimis Yes, that is true, although the hope is that the estimates of Vhat will be mostly stable to using different random subsets so long as the subset is "large enough".

AlexSimis commented 5 months ago

@pcarbo I'm not sure what a "large enough" random subset would be in my case. I have two runs below comparing the Vhat matrices.

Vhat derived from a random subset of 21 SNPs per protein (~ 25k SNPs).
Vhat derived from a subset of 200k randomly selected tests.

Both approaches give similar proportions of pairwise sharing signals (98 to 99% - 'get_pairwise_sharing'). However, when I'm exploring the non-sharing signals specifically the response QTLs between time points I found that these highly fluctuate. For example, in the 1st approach, I discover a response QTL X which is shared between the exposed and control group while in the 2nd approach, the same QTL is no longer shared but unique to one of the groups.

Here is the link for the repository I have created, containing my code and the mash data object, it would be great if you could take a look.

Thanks again!

stephenslab / mashr

Issue with the selection of random subsample #123