stephenslab / mashr

An R package for multivariate adaptive shrinkage.
https://stephenslab.github.io/mashr
Other
88 stars 19 forks source link

eqtl analysis problem - decrease running time and memory burden - running mashr by chromosome #127

Open jke20 opened 3 months ago

jke20 commented 3 months ago

Hi dear authors, thank you so much for developing mashr. Recently, I was trying to apply mashr on my eqtl pipeline outputs for discovering tissue-specific effects and shared-tissue effects (conditions here are different tissues in human brains). As you know, there are many gene-variant pairs and in our study, there are over 100 brain tissues. To decrease the computational burden, I wonder if I can run mashr by chromosome using the same covariance (strong matrix that takes the most significant eqtl from each gene across all chromosomes)? I don't know how will the final results be affected if I do that. Thank you for your help in advance!

pcarbo commented 3 months ago

@jke20 Thanks for your feedback. Could you tell us a little bit more about the inputs you are providing to mash? If I understand correctly, your Bhat is roughly 10,000 x 100 (one row for each gene, one column for each brain tissue)?

jke20 commented 3 months ago

@pcarbo thank you for the reply, the matrix is 200,000,000 * 100. rows for gene-variant pairs and columns for tissues.

pcarbo commented 3 months ago

@jke20 Potentially you could take a random subset of the gene-variant pairs, then rerun mash a second time with fixg = TRUE on each chromosome, for example; see help(mash) for details.

jke20 commented 2 months ago

Hi, thank you very much for the help and i think mashr is running nicely right now. Here is a following question: Below I run mashr with two types of covariances:

# data driven covariances
U.pca = cov_pca(data.strong, 5)
U.ed = cov_ed(data.strong, U.pca)
# canonical covariances
U.c = cov_canonical(data.random)
# run mashr for null hypothesis
m = mash(data.random, Ulist = c(U.ed,U.c), outputlevel=1)
# rerun mashr on strong matrix
m2 = mash(data.strong, g=get_fitted_g(m), fixg=TRUE)

I wonder what's the difference between the results from above and the results if I run mash with only 1 type of covariance (like m = mash(data.random, Ulist = c(U.ed), outputlevel=1)). Thanks!

surbut commented 2 months ago

Hi Jianfeng,

Thanks for your questions - it you run with the list of deconvolved (U.ed) as you write l after it has already been initialized with U.pca and data.strong , U.ed should return a list of several covariance matrices (depending on how many PCS you initialized with, here 5). That will allow mash to put weight on flexible patterns, whether it’s enough (or also requires canonical matrices) depends on your data - you can try both and see what improves likelihood on training/testing workflow.

Thanks for your question. -Sarah

On Aug 27, 2024, at 10:24 PM, Jianfeng Ke @.***> wrote:

Hi, thank you very much for the help and i think mashr is running nicely right now. Here is a following question: Below I run mashr with two types of covariances:

data driven covariances

U.pca = cov_pca(data.strong, 5) U.ed = cov_ed(data.strong, U.pca)

canonical covariances

U.c = cov_canonical(data.random)

run mashr for null hypothesis

m = mash(data.random, Ulist = c(U.ed,U.c), outputlevel=1)

rerun mashr on strong matrix

m2 = mash(data.strong, g=get_fitted_g(m), fixg=TRUE) I wonder what's the difference between the results from above and the results if I run mash with only 1 type of covariance (like m = mash(data.random, Ulist = c(U.ed), outputlevel=1)). Thanks!

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/mashr/issues/127#issuecomment-2313961471, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCI4XIGYRA4ZNRBJPKW6B3ZTUYG7AVCNFSM6AAAAABL7QVDYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTHE3DCNBXGE. You are receiving this because you are subscribed to this thread.

pcarbo commented 2 months ago

Thanks, Sarah.

Just to add to what Sarah said, in general mash will be faster with fewer matrices, but more matrices gives you more flexibility to model different sharing patterns. So there will be a tradeoff. In practice, as Sarah said, the data-driven matrices (U.ed) in your code are more adaptable, so Ulist = U.ed could be a convenient (i.e., slightly faster) option.