stephenslab / mashr

An R package for multivariate adaptive shrinkage.
https://stephenslab.github.io/mashr
Other
88 stars 19 forks source link

application to very sparse conditions, need to intersect? #126

Open BradleyH017 opened 4 months ago

BradleyH017 commented 4 months ago

Dear authors,

Firstly, thank you for such a powerful and useful package. I am hoping to use it to identify sharing/specificity of top eQTLs detected across cell-types (conditions) from scRNAseq data. I have a couple questions as to how best utilise this package in this context.

Following the tutorial closely (https://stephenslab.github.io/mashr/articles/eQTL_outline.html), I initially made use of the FastQTL to mashr input prep workflow (https://github.com/stephenslab/gtexresults/blob/master/workflows/fastqtl_to_mash.ipynb). I realised that there is only a single result (gene x variant) from each gene in the ‘strong’ set, and so presume the top effect across conditions has been selected. As this is the case, would you recommend keeping to a single gene x variant combination per gene over including the top association in each condition ?

Also, there are no missing values in either the 'strong' or ‘random’ set, so presume these tests are limited to those present in every condition. The intersection of genes being tested in each condition is very small (<2k of >20k genes) and so I am wondering if you think taking an intersection is sufficient to accurately capture the strong/random effects in the data?

Alternatively, I had imagined I would have to ‘fill’ missing tests across conditions, related to another issue #17 , starting with a beta of 0 (so no magnitude or direction) and SE of 1 (far larger than the median of 0.16) for missing tests. However, again due to the sparsity, for some tests this means filling the majority of conditions with these values. Is the filling of missing values in this way something you would recommend to this end?

Thanks again!

pcarbo commented 4 months ago

@BradleyH017 Thank you for the positive feedback.

As this is the case, would you recommend keeping to a single gene x variant combination per gene over including the top association in each condition?

@gaow can correct me if I'm wrong, but I believe you want to use only a single SNP for all conditions. So you would have to come up with some criterion for choosing this SNP; e.g., the SNP with the smallest p-value among all the conditions.

Is the filling of missing values in this way something you would recommend to this end?

This isn't well documented, but mashr can handle limited amounts of missing data by setting Bhat to zero and Shat to be a large value (e.g., 10). However, if there is too much missing data, mashr may fail, and then you may need to remove the conditions that have many missing entries. For estimating the data-driven covariance matrices, also it may be able to handle a small amount of missing data.

Also please consider using the improved methods in udr to estimate the data-driven covariance matrices for mashr. Again, udr should be able to handle small amounts of missing data for estimating the data-driven covariance matrices.

Hope this helps.

gaow commented 4 months ago

@gaow can correct me if I'm wrong, but I believe you want to use only a single SNP for all conditions.

Indeed, because we need to ensure the strong SNPs are independent of each other.

We have been trying alternative approaches to first fine-map each condition, take the top SNP from fine-mapped CS and put those variants together, and finally do some multi-variate LD clumping to keep independent variants. That will give us more strong SNPs to work with. This will change the pattern of sharing a bit but not that much, in our snRNA-seq based studies.

BradleyH017 commented 4 months ago

Hi @gaow @pcarbo, thank you very much for you rapid responses.

Indeed, because we need to ensure the strong SNPs are independent of each other.

Okay, this makes sense. In that case, could you confirm how the 'strong' set of variants are selected in the workflow (https://github.com/stephenslab/gtexresults/blob/master/workflows/fastqtl_to_mash.ipynb). E.g, are these already LFSR<0.05?

mashr can handle limited amounts of missing data by setting Bhat to zero and Shat to be a large value (e.g., 10)

Would you also recommend this to be similarly tolerated in the 'random' subset also?

Also please consider using the improved methods in udr

I'll dig through, thanks for the recommendation!

Thanks again.

gaow commented 4 months ago

how the 'strong' set of variants are selected in the workflow

we simply picked the row containing the smallest p-value per gene, without assessing its significance.

surbut commented 4 months ago

Hi Bradley,

Thanks for your interest! Yes - in the eQTL case, we choose the SNP with the lowest p value (highest absolute Z statistic) across subgroups per gene. Hope this helps, Sarah Urbut

On Jul 31, 2024, at 7:56 AM, gaow @.***> wrote:

how the 'strong' set of variants are selected in the workflow

we simply picked the row containing the smallest p-value per gene, without assessing its significance.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/mashr/issues/126#issuecomment-2260348867, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCI4XJVBHRHBWY6HSGPON3ZPDGIHAVCNFSM6AAAAABLUVROWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRQGM2DQOBWG4. You are receiving this because you are subscribed to this thread.

pcarbo commented 4 months ago

Would you also recommend this to be similarly tolerated in the 'random' subset also?

Yes.

BradleyH017 commented 4 months ago

Great, thank you all for the help and clarification!