statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
116 stars 25 forks source link

Understand demuxlet output #86

Closed Zepeng-Mu closed 3 years ago

Zepeng-Mu commented 3 years ago

Hi, I have some questions about applying demuxlet to 10X scATAC data. First, I'm not totally sure about how to interpret .single and .best file. For example:

AAACGAAAGAAAGGGT-1  PBMC001 12228   988 496 496 -720.18581  -471.98225  1
AAACGAAAGAAAGGGT-1  PBMC002 12228   988 496 496 -754.03251  -471.98225  2e-15
AAACGAAAGAAAGGGT-1  PBMC003 12228   988 496 496 -772.84101  -471.98225  1.36e-23
AAACGAAAGAAAGGGT-1  PBMC004 12228   988 496 496 -768.06147  -471.98225  1.61e-21

where the posterior suggests this barcode may come from PBMC001, but looking the .best file:

AAACGAAAGAAAGGGT-1  12228   988 496 496 DBL-PBMC001-PBMC004-0.500   PBMC001 -519.0389   PBMC002 -552.8874   -270.8262   PBMC001 PBMC004 0.500   -326.8617   -519.0389   -566.9160   -519.0389   -326.8617   -270.8262   1   1

this barcode is a doublet. I'm wondering whether this is expected to happen a lot. Maybe the best way to run demuxlet is just to look at the best file? But I think this difference between single and best is a bit surprising to me, and I'm concerned that this may indicate some problem with my data?

Another issue I encountered is that most of the barcodes in my data are doublets. The majority of singlets I found have very few reads, like hundreds of reads. These cells really cannot be used in analysis, but the fact they can be identified as singlets suggests even a few hundreds of reads carry sufficient information to deconvolute from 4 individuals? If this is true, what could be possible reasons that most cells with a lot (e.g. > 3000) fragments are doublets? Could this mean that I have very poor genotype data?

Thanks!