phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Purpose of max kmer frequency? #142

Open peterk87 opened 3 years ago

peterk87 commented 3 years ago

It's not clear why this threshold was implemented and what kinds of situations it's supposed to be help with.

It seems like it would accidentally exclude certain kmers from subtype calling if the frequency of those kmers is "too high":

https://github.com/phac-nml/biohansel/blob/d20a00be813e72a678c3e8448ca783268b71c3cd/bio_hansel/subtyper.py#L287

glabbe commented 3 years ago

@peterk87: The max frequency threshold could be useful when analyzing metagenomics datasets. For example, Salmonella schemes k-mers are generally present at low frequency (~10-100 fold genome coverage) in the metagenomics samples that we have tested so far. However, we have found that a handful of the S. Enteritidis scheme k-mers were present at 3,000-6,000 fold coverage in some metagenomics samples, and we suspect that these match E. coli genome sequences. By excluding these high coverage k-mers, we get a more accurate average genome coverage estimate (represented by the avg_kmer_coverage value in the BioHansel results) for the pathogen targeted by the scheme. I would recommend increasing the default value to a very high number, in order to avoid accidental exclusion of valid k-mers in most situations.

peterk87 commented 3 years ago

Hi @glabbe

What about using median kmer frequency instead to exclude these extremely high frequency outliers? Kind of like how median household income is more informative than the mean value since the 0.1% will drag up the mean value substantially.

glabbe commented 3 years ago

That sounds good to me @peterk87. If we implement the median frequency calculation, we could keep an option to implement a user-defined maximum frequency threshold, but to have it "off" by default?

peterk87 commented 3 years ago

I think off by default would be a good idea and convenient for viral amplicon sequencing data especially.

Based on the Salmonella example, it sounds like having a QC message warning of extremely high frequency kmers rather than excluding those kmers from the subtype calling may be more appropriate.

glabbe commented 3 years ago

Yes, agreed! That would be ideal, and most informative to the user.