statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
117 stars 25 forks source link

Excessive read filtering #42

Closed plantformatics closed 5 years ago

plantformatics commented 5 years ago

I am having an issue (or lack of understanding) with the read filtering. The documentation states that the RD.TOTL specifies counts of reads overlapping variants and RD.PASS indicates reads passing quality filters. However, when I set all quality thresholds to floor values (as well as a range of other integers/floats where noted), an absurd amount of reads are tossed from the analysis. Here is a snap shot of the *.best file:

BARCODE RD.TOTL RD.PASS RD.UNIQ N.SNP
AAACGAAGTAACGGAC 1528 1 1 1 AAACTCGAGAGGTCCA 30 2 2 1 AAACTCGGTAACGGTG 650 2 2 2 AAACTGCAGTCACGCC 106578 2 2 2 AAACTGCAGTTGCTTG 1024 2 2 1 AAACTGCCAGCTGATT 13348 1 1 1

The parameters set were: demuxlet --sam alignments.bam --tag-group CB --tag-UMI UB --vcf genotypeData.vcf --field GT --geno-error 0.00001 --min-mac 1 --min-callrate 0.0001 --min-BQ 0 --min-MQ 0 --out test --min-total 0 --min-uniq 0 --min-snp 0

I should point out that I am analyzing paired-end scATAC-seq data. To get the UMI error to go away, I added a fake tag to the bam file (UB:Z:xxxxxxxx) with a random 8mer for each read. I suspect the issue is with the bam file. I have more than 9 million variants well-covered (75% of reads) by the PE data. I am also unclear what the 'callrate' refers to. Changing the genotype error, minor allele frequency, and read/variant filtering parameters have negligible effects. I have two genotypes in the pooled bam file. Any advice or suggestions would be fantastic.

Thank you.

plantformatics commented 5 years ago

Update: I think the issue may be with the VCF file. Very few SNPs are valid according to the summary output (which I never saw because its buried in the middle of STDOUT). Will post back here if someone else has a similar issue.

plantformatics commented 5 years ago

Update (2): Changing VCF format to bare bones (no INFO) and phased genotype (I have homozygous samples, so the new genotype format was changed to: 0|0 and 1|1) fixed the issue. The problem was almost no variants were being recognized.