phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
26 stars 7 forks source link

Add Aho-Corasick run mode for faster subtyping #23

Closed peterk87 closed 6 years ago

peterk87 commented 6 years ago

With the Aho-Corasick Automaton from the pyahocorasick Python library, it's possible to subtype reads or contigs much more quickly than with BLAST or Jellyfish.

In this PR, I've also added support for gzipped FASTA or FASTQ files.

I've added tests for the new AC run mode (also cleaned up the tests).

By default, AC will be used unless the --slow commandline argument is provided by the user.

The only downside to the AC method is that since you're not computing all kmer counts, you cannot calculate the min kmer coverage threshold automatically using the method currently in bio_hansel.