High doublet rate? - Githubissues

yesitsjess commented 3 months ago

I'm getting 33.4% of my UMIs predicted to be doublets (27.5% when clusters=F) and I read somewhere in the region of 10% is more usual. Any suggestions on what might've caused this? Or comments on if I'm doing something wrong, please?

# read 10x cellranger count output
sce <- read10xCounts(paste0(data_dir, samps_dir,  "/outs/filtered_feature_bc_matrix"), samps_dir)

# log normalise, perform PCA and generate UMAP
sce <- scater::logNormCounts(sce)
sce <- scater::runPCA(sce)
sce <- scater::runUMAP(sce)
plotReducedDim(sce, "UMAP")

# get clusters to run doublet finding function using cluster information
sce$cluster <- fastcluster(sce)

# identify suspected doublets
sce <- scDblFinder(sce, clusters="cluster")
#sce <- scDblFinder(sce, clusters=F) # alternatively

table(sce$scDblFinder.class)

I've also tried quickly clustering myself (rather than using fastcluster) and still get 23.2% doublets called.

g <- scran::buildSNNGraph(sce)
cl <- igraph::cluster_fast_greedy(g)$membership
sce$cluster <- cl

My dataset is basically all the same cell type so I would expect a low number of clusters - will this effect things? Also I haven't done any additional QC here, just output from cellranger count is being used (empty droplets filtered out). I was planning to import the doublet predictions from scDblFinder as a QC step in my main pipeline because I'm using cellbender remove-background and wasn't sure if this would render my counts incompatible with doublet detection.

scDblFinder v1.16.0

plger commented 3 months ago

Hi, what is samps_dir, and ncol(sce) ?

yesitsjess commented 3 months ago

Hi, what is samps_dir, and ncol(sce) ?

samps_dir is a vector containing the sample directory names (as output by cellranger count run) [1] "SITTA8" "SITTB8" "SITTC8" "SITTD7" "SITTD8" "SITTE7" "SITTE8" "SITTF7" "SITTF8" "SITTG7" "SITTG8" "SITTH8"

> ncol(sce)
[1] 75861

plger commented 3 months ago

It's always a good idea to read the "Getting started" documentation: https://plger.github.io/scDblFinder/articles/scDblFinder.html#multiple-samples

plger commented 3 months ago

and https://plger.github.io/scDblFinder/articles/scDblFinder.html#im-getting-way-too-many-doublets-called---whats-going-on

yesitsjess commented 3 months ago

So run it sample by sample and not on the whole dataset. Thanks, I'll try it.

plger / scDblFinder

High doublet rate? #108