martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
114 stars 16 forks source link

Correcting cell-level p-values for multiple comparisons? #94

Open schroeme opened 3 months ago

schroeme commented 3 months ago

Hi, thanks for a great package! I am working with a brain snRNAseq dataset and have run scDRS to test for the enrichment of MDD, ADHD, ALZ, MS, SCZ, and height GWAS hits (using the MAGMA scores from your original publication). For the cell-level MC p-values, is it appropriate to use a cutoff of 0.05 to say something like, X number of cells were significantly associated with X disease? Or should I be doing a B-H p-value correction based on the number of cells (i.e. total number of p-values computed)?

I also ran the group-level downstream analysis and found that very few cell types were significantly associated (FDR < 0.1; as plotted here: https://martinjzhang.github.io/scDRS/notebooks/quickstart.html) with these traits, despite prior studies (including your original paper), showing that many more should be. Any thoughts on this? Is this because of what you noted in the discussion section of the paper?: "Second, the fact that scDRS assesses the statistical significance of an individual cell’s association to disease by implicitly comparing it to other cells via matched control genes may reduce power if most cells in the data are truly causal."

Many thanks, Margaret

HelloWorldLTY commented 3 months ago

It is a really good question. I think to do correction or not to do, really depending on your cost for false positive or false negative. Performing bh correction is to reduce false positive rate, with the scrafication for missing true signals, but I think in this case, the cost of missing a true important cell type for a disease is larger than accepting a risky cell type for a disease, and thus I think it is ok to use the current p-value setting. https://stats.libretexts.org/Bookshelves/Applied_Statistics/Biological_Statistics_(McDonald)/06%3A_Multiple_Tests/6.01%3A_Multiple_Comparisons

For the second point, I am considering to improve it with more atlas-level datasets 🤔️.

martinjzhang commented 3 months ago

For the cell-level MC p-values, is it appropriate to use a cutoff of 0.05 to say something like, X number of cells were significantly associated with X disease? Or should I be doing a B-H p-value correction based on the number of cells (i.e. total number of p-values computed)?

I recommend always using FDR control. Detecting cells based on p<0.05 will give you a lot of false positives and is against the statistical principles of hypothesis testing. If it is very underpowered, consider increasing the FDR threshold, e.g., to 0.2.

martinjzhang commented 3 months ago

I also ran the group-level downstream analysis and found that very few cell types were significantly associated (FDR < 0.1; as plotted here: https://martinjzhang.github.io/scDRS/notebooks/quickstart.html) with these traits, despite prior studies (including your original paper), showing that many more should be. Any thoughts on this? Is this because of what you noted in the discussion section of the paper?: "Second, the fact that scDRS assesses the statistical significance of an individual cell’s association to disease by implicitly comparing it to other cells via matched control genes may reduce power if most cells in the data are truly causal."

Yes, this may indeed be the reason, that scDRS is underpowered. Again, consider increase the threshold.

Also, consider imputing the data using MAGIC first before applying scDRS, a procedure discussion here https://github.com/martinjzhang/scDRS/issues/32 This procedure seems to be a good workaround for the low power issue, as documented in a recent paper https://www.biorxiv.org/content/10.1101/2024.02.05.579042v1.abstract

Moreover, we are developing a much more powerful version of scDRS, which I hope to share in a few months.