Sampling and inspecting posterior

BaptisteSadoughi commented 2 months ago

Hello dear MASH team, thank you for developing this pipeline!

I am trying to apply MASH on Bhat and Shat extracted from differential methylation analysis (MACAU). The matrices are 8 x 236000 features. I followed the different vignettes without apparent issues until https://stephenslab.github.io/mashr/articles/mash_sampling.html. However, when trying to sample from the posterior using mash_compute_posterior_matrices() the call does not seem to ever finish - even with posterior_samples = 2.

Is there a way to "manually" sample from the posterior or would there be a way to inspect the m.all object to check what might be going wrong?

At the moment I only get results from get_pairwise_sharing() but would ideally like to uses get_pairwise_sharing_from_samples().

Thank you in advance for your help! Best,

pcarbo commented 2 months ago

Hi @BaptisteSadoughi, I'm not sure mash_compute_posterior_matrices() was designed to handle data sets of your size. One possible approach is to randomly subsample the 236,000 features, and then call mash_compute_posterior_matrices() on that subsample (this will first involve running mash on that subsample with a fixed prior — see for example Issue 127 for details), and repeat this for many (say, hundreds) of subsamples. This is giving you a Monte Carlo estimate of the sharing, which is essentially what we are doing already (since we are using random draws from the posterior distribution.)

Hope this helps.

BaptisteSadoughi commented 2 months ago

Hi @pcarbo, thanks for your reply. Indeed subsampling to 1000x8 with posterior_samples = 100 finishes. Any guess on the maximal row size that would be acceptable?

However, I am not sure at which moments it is best to subsample. Say I start from a 100,000x8 dataset. I could subsample my original data, which would impact the definition of the strong and the random subjects; or I could subsample the data just before passing it to mash(data, g=get_fitted_g(m), fixg = TRUE) but not sure if that would be valid. Do you have a recommandation?

If helpful here's a code snippet of the workflow. dataset is a matrix 100,000x8. data = mash_set_data(dataset$matrix_beta, dataset$matrix_SE) ##could subsample dataset here? m.1by1 = mash_1by1(data) strong.subset = get_significant_results(m.1by1, thresh = 0.01) random.subset = sample(1:nrow(pqlseq_effects$matrix_beta), 30,000) data.random = mash_set_data(dataset$matrix_beta[random.subset,], dataset$matrix_SE[random.subset,]) Vhat = estimate_null_correlation_simple(data.random) data.random = mash_update_data(data.random,V=Vhat) data.strong = mash_set_data(dataset$matrix_beta[strong.subset,], dataset$matrix_SE[strong.subset,], V=Vhat) U.pca = cov_pca(data.strong, 5) U.ed = cov_ed(data.strong, U.pca) U.c = cov_canonical(data.random) m = mash(data.random, Ulist = c(U.ed, U.c), outputlevel = 1) m.all = mash(data, g=get_fitted_g(m), fixg = TRUE) ##could subsample data here? m.all$result = mash_compute_posterior_matrices(m.all, data, algorithm.version= 'R', posterior_samples = 100)

Two more questions: mash(help) does not detail the output of mash, is NegativeProb the proportion of posterior <0 (which I would expect to be high for a negative estimate and low for a positive estimate )? And would be possible to ease computation in mash_compute_posterior_matrices() by tuning pi_thresh, blocking the computation of lfdr, or using output_posterior_cov = FALSE?

Your help is much appreciated! Best

pcarbo commented 2 months ago

@BaptisteSadoughi I'm not sure I will be able to answer all your questions (I don't have all the answers).

See help(compute_posterior_matrices) for a description of "NegativeProb".

Ideally, you would use all your data in your first call to mash() that fits the prior ("g"). I believe it should be able to handle a data set of your size (it may not finish quickly, however). My advice was more about downstream steps, e.g., using mash_compute_posterior_matrices(), which as you observed is very slow.

You could experiment with the options to try to reduce the running time. It may help a bit — I'm sorry I don't remember off-hand which computations take the most amount of time to run.

Hope that helps a bit.

BaptisteSadoughi commented 2 months ago

Thanks @pcarbo that answers the most important part! mash() runs on the full dataset so I can subset just before running mash_compute_posterior_matrices().

Best,

stephenslab / mashr

Sampling and inspecting posterior #128