plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
153 stars 18 forks source link

Runing scDblFinder before and after removing low QC cells gives different results #79

Closed yeroslaviz closed 1 year ago

yeroslaviz commented 1 year ago

Hi, thanks for a really great and easy to use tool for identifying doublets in my data set.

I was wondering how good the tool should work on a SMART-Seq 2 data set with "only" < 100 cells.

I'm getting the warning, that it might cause a problem. But my question is different.

I have run the scDblFinder command on my sce object after removing low qc cell identified via addPerCellQCMetrics and only two cells were identified as doublet.

For some reason I needed to repeat the analysis and this time I have first ran the filtering for doublets only after removing lowQC cells. This time though it identified 9 cells as doublets. I know it is not much, but it's still >10% in my data set.

I'm mainly interested to understand if I can trust this results for such a small data set and if so why there is such a big difference, depending how (or when) one run the search.

thanks Assa

plger commented 1 year ago

scDblFinder is not deterministic so running it twice with different seeds will also give slightly different results. The difference you're talking about however seems bigger than this. I guess a first question is whether the additional putative doublets were among the cells that were removed by scater's QC.

Whether to run before or after QC is already discussed elsewhere: it's preferable to get rid of droplets with very little coverage (e.g. <500 reads) but otherwise run scDblFinder before further filtering.

scDblFinder on very small datasets

The deeper issue is more complex -- you might consider renaming the issue to whether one can use scDblFinder on very small datasets. I've never done so. What I can do though is take an existing dataset with a ground truth, keeps only cells above a certain libsize (2000 reads) to be more similar to smartseq data, downsample it (keeping a 10% doublets), run scDblFinder and evaluate (the code can be found here). The area under the precision and recall curve doesn't really change, meaning that doublets are consistently ranked higher than singlets even in small datasets. So in principle you should be able to use scDblFinder on such small datasets.

downsampling

This evalutes the ranking of the doublet scores, but not the thresholding (at which point to make the call). With small datasets, an overabundance of artificial doublets is generated to increase power, which skews the doublet score to the higher side, and the scores also tend to be less polarized. You can see this in the figure below: with many cells (bottom row), most cells get a very low score, a few (the doublets) instead get a very high score, and there's a huge gap between the two. In such cases, putting a threshold is easy. However you can see that as we decrease the number of cells, the gap gradually disappears, and it becomes very difficult to put the threshold.

downsampling_scoreHist

In such circumstances, it's very hard to establish, from the data, how many doublets you really have. Since this is smartseq, you probably don't even have a clear prior expectation. So while you can trust the doublet score (i.e. scDblFinder.score) to be a good ranking of doublets, you can't really trust the call, i.e. scDblFinder.class.

This problem is in good part because of the large number of artificial doublets created. Normally, roughly the same number of artificial doublets are created as there are cells. However, for small datasets this would mean few doublets, which can be insufficient to capture the possible mixings, therefore a hard minimum was set (originally to 5000). However this makes the thresholding job harder in very small datasets, as can be seen by varying the artificialDoublets parameter (rows here):

downsampling_hist_nAd

I've now changed this hard minimum to 1500 (currently only on github), which should still sample the mixing space while improving the separation in very small datasets such as yours. In your case, I'd recommend simply using the artificialDoublets parameter (e.g. setting it to, say, 500 or 1000).

Finally, what I would suggest is to first look at your score histogram: if you see a clear peak close to 1, your job is easier. If you don't, then you probably want to visualize the doublets. I recommend doing so on the PCA space, because it's linear, i.e. a doublet should be somewhere between the two cell types it's composed of.

Hope this helps, plger