phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

QC module not failing samples where positive and negative kmers are detected for the same genome position #98

Closed glabbe closed 5 years ago

glabbe commented 5 years ago

A QC module fix is required in order to flag datasets as “mixed” when both a postive kmer and a negative kmer are detected for the same genome position, within the set kmer frequency thresholds.

-In MTB strain ERR163996, the kmers 62657-4.1 and negative62657-4.1 are both detected in the fastq file at 44X and 11X coverage respectively. The strain subtype result is 4.1.2 and QC PASS using min_kmer_coverage =8 (default) and scheme tb_speciation v1.0.5. This sample should be flagged as mixed sample because of the conflict of positive and negative kmer at the same position: I am wondering why it passes QC? When running the same WGS dataset with min_kmer_coverage = 3, the sample does FAIL QC, with several positive kmers of subtype 4.3.4.2.1 detected at 3-4X coverage.

-Similarly, in MTB strain ERR182041, the kmers 4260268-4.6.1 and negative4260268-4.6.1 are both detected in the fastq file at 54X and 8X coverage respectively. The strain subtype result is 4.6.1.1 and QC PASS using min_kmer_coverage =8 (default) and scheme tb_speciation v1.0.5. This sample should also be flagged as mixed sample because of the conflict of positive and negative kmer at the same position. When running the same WGS dataset with min_kmer_coverage = 3, the sample does FAIL QC, with several positive kmers of subtypes 4.3.4.2.1 and 1.1.3 detected at 3-7X coverage.

-Similarly, in MTB strain ERR221649, the kmers 3977226-4.3.4 and negative3977226-4.3.4 are both detected at 115X and 14X coverage respectively, and the kmers 1132368-4.3.4.2 and negative1132368-4.3.4.2 are both detected at 114X and 8X respectively. The strain subtype result is 4.3.4.2.1and QC PASS at default QC settings. This sample should also be flagged as mixed sample because of the conflict of positive and negative kmer at the same position. When running the same WGS dataset with min_kmer_coverage = 3, the sample does FAIL QC, with a positive kmer of subtypes 4.9 detected at 5X coverage.

-There are other examples: strain ERR228033 for kmers 764995-4.3 and negative764995-4.3 at 71X and 14X coverage, result=subtype 4.3.4.2 and QC PASS with default settings; and strain ERR2515140 for kmers 3273107-3 and negative3273107-3 at 101X and 8X coverage, result=subtype 3.1 and QC PASS with default settings. Both failed QC with multiple conflicting subtypes detected when using min_kmer_coverage=3.

tb_speciation_scheme_v1.0.5.txt

DarianHole commented 5 years ago

Looking into this, it seems that the QC module is only checking the final subtype k-mer for matching both of the positive and negative kmers. Well that's what I believe to be seeing anyway.

I'll keep looking at it and see if that's true

edit

With a bit more digging at it, I strongly believe that in this case, the subtype is flagged as having consistent subtype kmer calls (as 4, 4.1, and 4.1.2 follow a consistent structure) and then the other function to catch this, the get_conflicting_kmers function, is only looking at the conflicting final subtype call. Meaning that only if there are any negative 4.1.2 kmers will it get a QC FAIL.

Solutions I can think of trying are: