plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
162 stars 17 forks source link

Clarify recommended nFeatures/artificialDoublets for scATACseq doublet removal in vignette #89

Closed jeremymsimon closed 12 months ago

jeremymsimon commented 12 months ago

Hi- I'm attempting to remove doublets from my scATAC-seq data largely following your vignette, but it would be helpful to have some clarification on how many nFeatures and artificialDoublets is recommended for real-world data. I have just under 1.1M features x ~85k cells in my full counts matrix, and I'm passing sample names (n=14 samples) and clusters from my integrated, clustered Seurat object so that the doublets can be modeled on each sample separately (like I do for scRNA).

In the vignette, you specify only nFeatures = 25 which seems quite small, is that only for the sake of speed for the vignette? And I assumed I could keep artificialDoublets=NULL, but is something required here?

My current command

sce <- scDblFinder(sce,
                 samples = "Sample",
                 clusters = "seurat_clusters",
                 aggregateFeatures=TRUE,
                 nFeatures = 1000,
                 processing="normFeatures"
)

exits with

Error in getArtificialDoublets(counts(sce), n = artificialDoublets, clusters = clusters, : unused argument (nFeatures = 1000)

Note this is running with scDblFinder v1.14.0 in R v4.3.1, so I can update to the newest version if needed

plger commented 12 months ago

Hi Jeremy,

Regarding your error, I think it's just a typo: the argument is nfeatures, not nFeatures.

For the rest, these are hard questions and I wish I could give you clear answers, but I'm afraid I can't: I didn't play around with enough scATAC datasets with a doublet 'ground truth' to be confident about the best way to proceed. However I can provide explanations that can hopefully guide you to some degree.

First, you probably already know that from the documentation, but just in case I want to mention that for scATAC, you should ideally rely on both the amulet method (Thibodeau et al.), of which there's an implementation in scDblFinder, and the scDblFinder method. The former typically works well for cells that have a large-enough library size (and will also detect homotypic doublets), while the latter will be better for cells that are more shallowly-sequenced.

The aggregation approach has two advantages: first, it allows one to make use of all peaks (frameworks like Signac and ArchR rely on a small selection of variable peaks), and second it makes the signal very easy to normalize (because the aggregated counts per meta-feature are much higher). Indeed, as we've shown in a recent benchmark, the TF-IDF transformation of LSI-based approaches creates a large library size effect in the several of the top components, and the very simple aggregation approach works unexpectedly well (which I frankly hadn't expected at all).

Now when running scDblFinder with aggregateFeatures=TRUE, processing="normFeatures", what happens is that the (normalized) aggregated features are used as the equivalent of what would be, normally (i.e. in scRNAseq), the PCA. That's why nfeatures in the vignette is so low: it was meant to be equivalent to the PCA dimensions.

Since the scDblFinder paper, and the observation that feature aggregation was a general strategy worth exploring for other tasks, we played a bit with the strategy, and what we for example tested for embedding/clustering tasks in the @RoseYuan 's above-mentioned benchmark is slightly different from the scDblFinder paper: we aggregated into many more meta-features (e.g. 1000), but then proceed with the PCA step. This works better for separating subpopulations, and presumably would also work better for doublet detection, but I haven't really tested that specifically and therefore can't say for sure. In scDblFinder, this could be achieved simply by setting nfeatures=1000 and omitting the processing="normFeatures". If you want, you can also play with the scDblFinder::aggregateFeatures function, which returns an SCE of the aggregated features -- you can for instance run a PCA+UMAP on it and see how your populations look like under different number of meta-features.

Finally, yes you're probably better off leaving artificialDoublets to the default. I see no reason why this should be different for ATACseq.

Hope this helps, Pierre-Luc

jeremymsimon commented 12 months ago

Amazing, thank you @plger for the helpful answer (and for catching my typo)! Yes my plan was to use the merge of both scDblFinder and amulet for now, but your other approaches described here could also be helpful depending on what the results look like.

I'll give this a try and reach out if I run into any issues- thanks again!