plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
162 stars 17 forks source link

Unreasonably high doublets rate #84

Closed t-nol closed 1 year ago

t-nol commented 1 year ago

Hello,

Thank you for this very useful tool. I am currently trying it on some developmental datasets and similarly to issue #69 I've been seeing unreasonably high doublet rates (10~15% per sample). In that same issue, the question "What kind of tissue is this? adult or developmental/trajectory-like?" was asked and I was wondering if you had any information on the impact of developmental like samples on the doublet detection rate. Is there anything to take note of during the processing ?

Each dataset corresponds to one 10x capture, with around 10 000 sequenced cells per samples. My datasets are stored as a list of seurat objects I iterate over. You'll find my code bellow :

  sce <- as.SingleCellExperiment(object.list[[i]])
  sce <- scDblFinder(sce) # for 10x data safe to leave the dbr empty, and it will be automatically estimated.

And here is an example of the distribution of the split_D$scDblFinder.score : scdblfinder_scores

plger commented 1 year ago

I don't see anything unreasonably high here. With 10k cells you'd expect 10% doublets, but this can vary a bit, and (as illustrated in #69 ) will be higher if your recovery rate is lower than normal. Your score distribution is pretty clear, with >10% of droplets being unambiguous doublets, and another 5% in the middle uncertain.

Doublet detection in developmental trajectories is simply harder, because many doublets can be similar to intermediate stages. But this is unrelated to your first question since it shouldn't affect the number of doublets called.

t-nol commented 1 year ago

Oh alright, thank you for the clarification. About doublet detection in developmental trajectories being harder. Do you have any advice on how to define if the doublets found by scDblFinder are actual doublets or intermediate stages ? I've processed my data while keeping these doublets for now and they are evenly spread out across all defined clusters.

plger commented 1 year ago

Assuming that your intermediate stages are present and not very rare in your dataset, the uncertainty should be reflected in a lower doublet score (i.e. not 1 but e.g. 0.5). That's what I meant by 'harder'.

If they're not, then it's a very tough question. On non-linear embeddings such as UMAP and tSNE, it's normal for most doublets to appear in (most often at the border of) one of the clusters, rather than in-between. On the PCA space however they'd typically be in-between, whereas intermediate stages would depart from a simple linear combination.

t-nol commented 1 year ago

I see. Thank you for explaining, really appreciate it. It makes sense.