diff_mean_test() and group sizes

jhb1980 commented 3 years ago

Hi Christoph,

absolutely phenomenal tool set, I'm really excited by the possibility to run DE testing on the output of sctransform! I was wondering if you could comment on what a sufficient minimal number of cells would ideally be for "robust" DE calling between two groups when working with the implementation of diff_mean_test()?

ChristophH commented 3 years ago

That's a good question and I've always wanted to formally test this. More cells is always better, but a more nuanced answer takes into account two more aspects

The expression level of the gene, i.e. mean UMI counts in group 1
The fold change, i.e. log2(mean_in_group1/mean_in_group2)

I've run simulations to get a better idea of how exactly these two factors affect the number of cells needed. This figure sums up the results:

For example, to detect a log2FC of 2 for a gene with mean 0.1 (so going from 0.1 to 0.4, bottom row, third panel), you would need about 100 cells per group. A decrease (negative log2FC, panel above) would be much harder to detect (ca. 80% recovery with 200 cells per group when going from 0.1 to 0.025)

On the other hand, if a gene is absent from one group and then goes up to medium-high (say from 0.001 to 1) even 20 cells will be sufficient.

Notes regarding these results

No p-value correction for multiple testing
Not looking at false positives - this is not telling us anything about specificity overall
Assuming balanced group sizes
Assuming corrected counts are used, i.e. no additional variability in the counts due to sequencing depth

R notebook

jhb1980 commented 3 years ago

Brilliant, thank you Christoph!

satijalab / sctransform

diff_mean_test() and group sizes #98