Wild Variability in doublet scores

markddesimone commented 3 months ago

Hi, thank you for the excellent tool. I was looking at the variability in scores between different runs on the same data and was very surprised to see barcodes called as doublets with very high probability in one run and almost zero probability in another. here's a scatter plot of one run's scores vs another: I was expecting variability of course, but not so dramatic. I'm not sure what to do with these results. For example I can run scDblFinder say 10 times and threshold on a high level of agreement on called doublets; or I could take the union of all called doublets on the 10 runs; or something in between. I'm interested in your opinion as to the better approach and the reasoning behind that.

Thank you

plger commented 3 months ago

Hi,

First, just to be clear: you most likely have thousands of cells in that experiment, the overwhelming majority of which are in the lower-left corner, and another mode in the upper right corners, so the large variability you're talking about normally concerns a very small proportion of cells. This being said, your question is perfectly valid, and I haven't really investigated it in depth, so I can only share an educated opinion.

The non-deterministic component of the method is the generation of artificial doublets. This sounds trivial because the possible combinations of cell types is typically relatively small, but it's made considerably harder by the variations in the mixing proportions.

If the samples are complex, it could be that one run didn't create the 'right kind of artificial doublet', while the other did, which would argue for taking the union. If this were the prevailing case, however, then increasing the number of artificial doublets should fix this. In my experience, doubling or tripling it does not fix it. Alternatively, it can also be that one run created homotypic artificial doublets that are so similar to some real cell that the cell got wrongly tagged, which would argue against taking the union. Since doublets are rare, the latter is more likely.

The issue is further complicated by the iterative training, which can to some extent lead to more 'polarized' probabilities: if in the initial training round a cell is thought to be a doublet, it will not be used as 'putative singlet' for the next training rounds, and as a result it (and its close neighbors) will be assigned a higher doublet score. (Similarly, random artificial doublets get excluded if they are deemed too undistinguishable from singlets). This means that relatively modest differences early on due to the random artificial doublet generation can get amplified, which could explain why, for those cells, the differences are dramatic. However, disabling this typically doesn't increase agreement across runs, rather the opposite.

Ultimately, the question boils down to what kind of error is most problematic to you. If you really don't want doublets and don't mind losing a few real cells on the way, use the union; if it's the opposite and you want to lose as few real cells as possible, use the doublets on which different calls agree. And if you're somewhere in the middle, use the mean score or a majority vote on the call. This is probably what I would do. But if I find some time next week I'll see if I can test the alternatives on some benchmark data.

Hope this helps,

Pierre-Luc

markddesimone commented 2 months ago

Pierre-Luc, thank you for your detailed response and the concluding paragraph. Very helpful.

I mapped the agreement over 10 runs to reds on a UMAP. It appears that the most variability is in the homotypic doublets which makes a lot of sense. The heterotypic doublets appear to be consistently called by scDblFinder. I can set a threshold and call accordingly. Thanks again, Mark

plger / scDblFinder

Wild Variability in doublet scores #106