statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
116 stars 25 forks source link

Doublets overestimation #68

Open castaway1990 opened 4 years ago

castaway1990 commented 4 years ago

Hi everyone, I've used Demuxlet since its first release now and I always got constant overestimation of doublets across multiple independent experiments. Demuxlet predicted doublets percentage with default parameters are ranging from 40% (total number of recovered cells from 10x ~ 6000 ) to 85% (total number of recovered cells from 10x ~ 17000). I'd exclude they are true doublets since i) according 10x v3 specifications these numbers are far from expected rates at those recovery levels and ii) Different tools are providing doublets estimation more in line with the kit specification.

I'm using the last version available at the time of writing. I've tried:

I started thinking that my problem comes from vcf generation. I'm obtaining the vcf from bulk RNA-seq data, which is the same approach used in another method's paper (https://doi.org/10.1186/s13059-019-1865-2) to benchmark Demuxlet, and in their case it works. The files contain from 60k to 80k variants, that are obtained through GATK best practices, with haplotype caller in genomic mode, applying filters and merging the different identities in the same file using gatk CombineGVCFs and then gatk GenotypeGVCFs.

The only problem i can see so far is that i'm missing the 1000g common and MAF filters that are suggested in README_vcf.md file. -So I should retain only loci with AF > 0.01 in 1000g (common)?

Thanks!!

mhulke commented 2 years ago

Hey Castaway1990,

Were you able to find a solution to this? I am dealing with the same challenge. For a single nuc RNA-seq experiment with ~14k cells loaded, I see 30-60% of cells flagged as doublets. This is consistent in both version 1 and version 2 of demuxlet with default parameters. We suspect that the high background level in certain samples might be exasperating the problem, but even relatively clean samples show a high doublet rate. I am currently comparing demuxlet output with souporcell and scrublet output to try to determine true doublets, but would be curious to hear of any solutions you came across.

hyunminkang commented 2 years ago

The apparent increase of doublets will likely due to the ambient RNAs, which is not yet addressed in this version of demuxlet software tool. Until I push a new fix to address ambient RNA contamination, Souporcell may work better in such settings.