Closed t-nol closed 1 year ago
I don't see anything unreasonably high here. With 10k cells you'd expect 10% doublets, but this can vary a bit, and (as illustrated in #69 ) will be higher if your recovery rate is lower than normal. Your score distribution is pretty clear, with >10% of droplets being unambiguous doublets, and another 5% in the middle uncertain.
Doublet detection in developmental trajectories is simply harder, because many doublets can be similar to intermediate stages. But this is unrelated to your first question since it shouldn't affect the number of doublets called.
Oh alright, thank you for the clarification. About doublet detection in developmental trajectories being harder. Do you have any advice on how to define if the doublets found by scDblFinder are actual doublets or intermediate stages ? I've processed my data while keeping these doublets for now and they are evenly spread out across all defined clusters.
Assuming that your intermediate stages are present and not very rare in your dataset, the uncertainty should be reflected in a lower doublet score (i.e. not 1 but e.g. 0.5). That's what I meant by 'harder'.
If they're not, then it's a very tough question. On non-linear embeddings such as UMAP and tSNE, it's normal for most doublets to appear in (most often at the border of) one of the clusters, rather than in-between. On the PCA space however they'd typically be in-between, whereas intermediate stages would depart from a simple linear combination.
I see. Thank you for explaining, really appreciate it. It makes sense.
Hello,
Thank you for this very useful tool. I am currently trying it on some developmental datasets and similarly to issue #69 I've been seeing unreasonably high doublet rates (10~15% per sample). In that same issue, the question "What kind of tissue is this? adult or developmental/trajectory-like?" was asked and I was wondering if you had any information on the impact of developmental like samples on the doublet detection rate. Is there anything to take note of during the processing ?
Each dataset corresponds to one 10x capture, with around 10 000 sequenced cells per samples. My datasets are stored as a list of seurat objects I iterate over. You'll find my code bellow :
And here is an example of the distribution of the split_D$scDblFinder.score :