phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Degenerate Base Expansion addition #91

Closed DarianHole closed 5 years ago

DarianHole commented 5 years ago

Addressing Issue #60:

Added:

DarianHole commented 5 years ago

Previous bugfix should probably be implemented before this one is

DarianHole commented 5 years ago

I currently have the kmer check before the automaton creation but isolated so it parses the scheme fasta twice. This is due to the way I have it keeping track of the number of kmers.

I assume I should try to put the check into the automaton creation instead to lower run times and a better code. I'll see if I can create a solution for this

DarianHole commented 5 years ago

I'll throw the code here, let me know which way is wanted. This method can check in automaton so we don't have to parse the scheme twice. however the tradeoff is that we make the kmers as we go and then cut off if we get to too high a number. I'm not sure which way is faster.

def check_total_kmers(kmer, total):
    kmer_number = 1
    for char in kmer:
        length_key = len(d[char])
        kmer_number = kmer_number * length_key
    total = total + kmer_number
    if total > 150:
        return logging.error('Did it work?')
    return total

total = 0
for kmer in kmers:
    total = check_total_kmers(kmer, total)