statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
117 stars 25 forks source link

Demuxlet filtering out vast majority of reads that 10X deems good #22

Open LuckyMD opened 6 years ago

LuckyMD commented 6 years ago

Hi!

I've been trying to use demuxlet to demultiplex a mixture of cells from two donors as a test run. I have 10x data which was processed with CellRanger version 2.1.1, and the summary stats I'm getting appear quite good (Q30 Bases in Barcode: 97.5%, Q30 Bases in RNA Read: 79.1%, Reads mapped confidently to transcriptome: 72.8%), however demuxlet filters the vast majority of my data out.

I'm running demuxlet with the following (mostly default) input: Options for input SAM/BAM/CRAM : --sam [...], --tag-group [CB], --tag-UMI [UB] Options for input VCF/BCF : --vcf [...], --field [GT], --geno-error [0.01], --min-mac [1], --min-callrate [0.50], --sm, --sm-list Output Options : --out [test], --alpha, --write-pair, --doublet-prior [0.50], --sam-verbose [1000000], --vcf-verbose [10000] Read filtering Options : --cap-BQ [40], --min-BQ [13], --min-MQ [20], --min-TD, --excl-flag [3844] Cell/droplet filtering options : --group-list, --min-total [500], --min-uniq [500], --min-snp

During the run, the read QC output from demuxlet is giving me the following:

NOTICE [2018/05/01 08:46:14] - Total number input reads : 767538903 NOTICE [2018/05/01 08:46:14] - Total number of read-QC-passed reads : 321803293 NOTICE [2018/05/01 08:46:14] - Total number of skipped reads with ignored barcodes : 0 NOTICE [2018/05/01 08:46:14] - Total number of non-skipped reads with considered barcodes : 298996268 NOTICE [2018/05/01 08:46:14] - Total number of gapped/noninformative reads : 295557137 NOTICE [2018/05/01 08:46:14] - Total number of base-QC-failed reads : 250806 NOTICE [2018/05/01 08:46:14] - Total number of redundant reads : 707077 NOTICE [2018/05/01 08:46:14] - Total number of pass-filtered reads : 2481248 NOTICE [2018/05/01 08:46:14] - Total number of pass-filtered reads overlapping with multiple SNPs : 178168

While CellRanger tells me 72.8% of my reads are mapped confidently to the transcriptome, only 2.5 million of the 767.5 million reads demuxlet finds get through the read QC. Could anyone tell me why this may be the case? Should I be setting the read QC parameters to be more lenient?

It appears a lot of my reads are deemed 'noninformative'. Could someone tell me what this is likely to mean?

Another curious aspect of the whole thing is that demuxlet appears to be finding more reads than CellRanger (767 million total reads quoted in demuxlet vs 714 million total reads in CellRanger web summary output).

I would be very grateful for some advice or a point in the right direction.

LuckyMD commented 6 years ago

Hi, I'd be very grateful for a reply. Has anyone else encountered similar problems?

Best,

Malte Lücken, DPhil Postdoc, Machine Learning Helmholtz Center Munich

jcgrenier commented 6 years ago

My guess would be that demuxlet is only keeping the reads overlapping the genotypes you are giving as a reference. You could manually verify this by running any tool (like BEDtools) able to subset your alignment file based on any match with the positions within your genotyping file.

Hope this help.

JC

LuckyMD commented 6 years ago

Thanks for the pointer. I'll give BEDtools a go.

LuckyMD commented 6 years ago

So for anyone who is facing a similar problem. The read filtering is not the issue. The issue was me setting --min-uniq to 500, as I assumed this was the minimum number of UMIs per cell. Instead, this is the minimum number of UMIs that overlap with SNPs. 500 is clearly far too high for this and you only really need 20 or 30.

The number of reads being different seems to be an issue with the number of reads quoted by CellRanger being some filtered number and not the total in the bam file.

x811zou commented 6 years ago

@LuckyMD May I ask how did you make the .vcf file for a pooled sample with 6 individuals?

LuckyMD commented 6 years ago

Hi, I actually received the vcf file from a collaborator. But I can tell you it was generated with PLINK v1.90.

Unfortunately you have to do some reordering if you have 10X data and use CellRanger though. It seems CellRanger sorts the bam file by string and not integer (e.g. "0, 1, 10, 11,...") while PLINK sorts it via integer (e.g. 0,1,2,3,...). I did the reordering by manually sorting the header and then using vcftools to make the file match the header.

I hope that helps.