plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
153 stars 18 forks source link

Unreasonably high doublets rate #69

Closed zqun1 closed 1 year ago

zqun1 commented 1 year ago

Dear developers,

Thank you very much for developing this useful tool. I tried it on my dataset. I used the samples = sampleID argument. However, I still have >10% doublets rate, which is unreasonable. Could you help please?

Here is my code:

bp <- SnowParam(8, RNGseed=1234) #to make the results reproducible. Unix use MulticoreParam()
bpstart(bp)
split_D<- scDblFinder(split_D,samples = 'sampleID',BPPARAM = bp) #splitD is my SCE object. 
bpstop(bp)
split_D@colData$scDblFinder.class %>% table
singlet doublet 
  31037    3260

Here are the numbers of cells for each sampleID:

split_D@colData$sampleID
4210      5831      6486      2981      5037      5525      1424      2803. 

I double checked in the resulting SCE object and the scDblFinder.sample equals the sampleID.

According to 10X, each sample at this cell number should contain <5% doublets: https://kb.10xgenomics.com/hc/en-us/articles/360001378811-What-is-the-maximum-number-of-cells-that-can-be-profiled-

sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocParallel_1.32.5         scDblFinder_1.13.7          SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.6         IRanges_2.32.0             
 [9] S4Vectors_0.36.1            BiocGenerics_0.44.0         MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] future_1.31.0               dittoSeq_1.10.0             forcats_0.5.2               stringr_1.5.0              
[17] dplyr_1.0.10                purrr_1.0.1                 readr_2.1.3                 tidyr_1.2.1                
[21] tibble_3.1.8                ggplot2_3.4.0               tidyverse_1.3.2             plyr_1.8.8                 
[25] data.table_1.14.6           SeuratObject_4.1.3          Seurat_4.3.0          
plger commented 1 year ago

Hi,

zqun1 commented 1 year ago

Thank you for the quick reply!

  1. Yes.
  2. They are sorted immune cells from adult mice.
  3. I aimed for 10k cells for sequencing. For GEM generation, I input 10- 20 k cells per sample (the vert starting step). And in the end, I only captured 1.4-6.5k cells as mentioned above.
  4. See below
p1= hist(split_D$scDblFinder.score,plot = F)
p1$density <- p1$counts/sum(p1$counts) * 100
plot(p1, freq = FALSE) 

image

Hi,

* I assume the sampleIDs are individual 10x captures (i.e. no cell barcoding or such)? 

* What kind of tissue is this? adult or developmental/trajectory-like? 

* Do you know how much cells were put into the machine originally?

* Could you plot a distribution of the `split_D$scDblFinder.score`?
  (FYI you should avoid using `@`; the colData columns can be accessed directly with `split_D$whatever`) **Thanks**
plger commented 1 year ago

Hi, ok this is as I thought, I'm afraid you really do have ~10% or so doublets. The determining factor for the doublet rate is the number of cells loaded, as this influences the density and hence the probability that two are captured in the same droplet. The fact that many of these cells were for instance too damaged (or otherwise...) to pass cellranger's early QC (i.e. calls of what's a cell and what's an empty droplet) doesn't influence the doublet rate. (Note that this isn't the only possible explanation for few cells / few reads in cells) So sorry if it's a disappointment for you, but I think scDblFinder does a nice job of finding them despite having the wrong expected doublet rate :)

zqun1 commented 1 year ago

Hi, I see. So I should not look at the number of cells recovered from sequencing to determine the doublet rate. But for some reason, unfortunately, my recover rate is significantly lower than expected (as listed by 10X), right?

Computationally, scDblFinder only knows the number of cells I recovered from 10X. Therefore, the expected doublet rate (dbr) is probably determined by the recovered cell number, isn't it? How come the threshold for scDblFinder.score was decided so that the actual doublet rate is more than 2x of the expected rate? These questions may sound naive but I am curious 😅

plger commented 1 year ago

Hi,

Yes, you have a lower recovery rate than expected. I'm really not an expert there, but in my experience this has typically been attributable to low cell viability and/or expired/contaminated reagents (e.g. the buffer), but you'd have better luck trying to understand this with wet lab people.

Yes, scDblFinder estimates the dbr from the recovered cells. However, the thresholding is not only based on this: as described in the paper, it's also based on the ability to correctly classify artificial doublets. This often has a larger influence than the expected doublet rate, and in your case rescued the thresholding.

zqun1 commented 1 year ago

Thank you very much, plger! You can close this issue now.