satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.24k stars 902 forks source link

adjusted p-values in FindAllMarkers are incorrect #3384

Closed jan-glx closed 3 years ago

jan-glx commented 4 years ago

FindAllMarkers returns the adjusted p-values as returned by calls to FindMarkers unmodified. But it should adjust for testing multiple groups too. One way to achieve FDR (for test.use="DESeq2") / FWER (other methods) control again is to p_val_adj <- pmin(p_val_adj * length(x = idents.all), 1).

jaisonj708 commented 4 years ago

FindAllMarkers essentially amounts to running FindMarkers for each of your pre-defined clusters, so we do not feel that a multiple testing correction is necessary. (If, for example, the method were to compare gene expression between a given cluster and every other cluster independently, then a testing correction would be necessary, but this is not what FindAllMarkers does.)

We will be eventually use Presto for differential expression testing once it is on CRAN, so feel free to refer to this package's default behavior.

jan-glx commented 4 years ago

Running FindMarkers multiple times (once for each Ident level) and concatenating results without further adjustment of p/q-values is precisely the problem. The problem is discussed in e.g. https://arxiv.org/abs/1106.3670 , but you can also quickly check it yourself by simulating expected_FDR <- mean(replicate(1000, any(unlist(lapply(list(runif(10), runif(10)), p.adjust, method = "BH"))<0.1))) and finding that it will be larger than 0.1 (hererunif simulates the pvalues from a test where H0 is true).
The example solution I proposed is not optimal but after reading a bit more about the problem it seems that it is not clear what the best solution is in general. Thus perhaps a warning/ pointer to possible solutions in the documentation of FindAllMArkers is a good alternative.

For one Perturb-seq data set I analyzed this really made the difference between alleged "hits" and no hits. This a bit of an extreme case due to the large number of "clusters" (= targeted genes) though.

One might think that this is not too bad in general since most users probably use FindMarkers after finding clusters from the same data, in which case all test options are clearly invalid anyway. But I think we should care nevertheless; While choosing less powerful parametric test as default might counterbalance the problem a bit, the use of the more conservative Bonferroni correction (or FWER control in general) probably helps less than one might expect: Biologist often only care about the "top hits" anyway and for these the difference to e.g. Benjamini-Hochberg is small.

I have not yet checked out Presto but I am intrigued to learn how they tackle the problem, thanks for the pointer!

jaisonj708 commented 4 years ago

Thanks for your response. We will certainly include a warning about these issues in the documentation.

denvercal1234GitHub commented 3 years ago

Humm. This is a very interesting and important discussion. I wish I know more stats for sure. But, any update on the Presto package for DE analyses. And @jan-glx, would you mind helping me understand why finding cluster markers after defining the cluster is "failure" by design? Is it because we already decided what genes go into grouping the clusters?

Thank you very much!

jan-glx commented 3 years ago

@denvercal1234GitHub The issue with "finding cluster markers after defining clusters" (that I referred to in "FindMarkers after finding clusters from the same data, in which case all test options are clearly invalid anyway" and which is unrelated to the issue raised here) comes from using the same data twice: once for clustering and once for testing for differences between the clusters.

<- expand for longer reply To see how this is can be a problem, consider the output of: ``` hist(replicate(100, {x <- matrix(rnorm(100), 10); t.test(x[,1], kmeans(x, 2)$cluster)$p.value})) ``` which is far from a uniform distribution. (the code simulates from a 10 dimensional (multivariate) standard normal, clusters it with k-means and compares the first dimension ("a gene") between the two clusters with a t-test (that is positive in most cases)) The problem is mentioned in several publications and for example ["Valid Post-clustering Differential Analysis for Single-Cell RNA-Seq" (Zhang et al. 2019)](https://doi.org/10.1016/j.cels.2019.07.012) tries to find a solution (but in my opinion does not discuss well how ill posed the the question really is). Challenge II of ["Eleven grand challenges in single-cell data science" (Lähnemann et al., 2020)](https://doi.org/10.1186/s13059-020-1926-6) discuses the problem and gives further references. In my opinion, (in absence of external cluster labels) it boils down to two separate questions: I) are there clusters (= is the distribution of cells in gene expression space not unimodal)? and II) which genes are correlated with the marker genes? Note, that the problem with reusing the same data twice disappears if the uncertainty in cluster assignment disappears such that assignment no longer depends on the expression of any single gene (this is often almost the case in practice ). So, perhaps the problem is not so much about using the same data twice but about ignoring the uncertainty in cluster assignment. Not sure if this is what you are hinting at, but if you can also avoid using the same data twice by using separate genes for clustering and DE testing (consider a leave-one out approach) but then you are basically only asking: "is this gene correlated with other genes?".
denvercal1234GitHub commented 3 years ago

Thank you very much @jan-glx!

rsatija commented 3 years ago

Apologies that I missed this discussion before now. Its an interesting discussion about multiple testing, and it depends on the intent of the user when running the command. We originally wrote the function as a way to parallelize multiple DE analyses which would be analyzed independently, which is why we did not include additional correction.

However, the more important issue is that p-values associated with all DE single-cell analyses are not well-controlled due to 'clustering bias' (discussed for example in https://www.sciencedirect.com/science/article/pii/S2405471219302698?via%3Dihub). The manuscript describes an interesting way to mitigate this effect, but (outside of extensive random permutation), its not really possible to correct for this entirely. This is always important to keep in mind when thinking about p-values for these types of analyses.

jan-glx commented 3 years ago

Thanks for getting back to this and explaining the reasoning; though I highly doubt that many user that runFindAllMarkers on 10 groups of cells realize that at on average more than one "significant at 0.1 FDR" difference is to be expected just by chance and that they would therefore be able to analyze and interpret the parallelized DE analyses correctly.

I agree that pseudo replication and clustering bias are often the bigger problem, but clustering bias is not always an issue (e.g. when identities are inferred from external information, such captured sgRNAs in Pertubseq-like experiments as in the use-case mentioned above).

In any case, both (all three) issues should probably be mentioned in the documentation?