refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

enable amino acid kmer matching through translation to canonicalized codons #142

Open notestaff opened 4 years ago

notestaff commented 4 years ago

Add support for matching amino acid kmers. An amino acid kmer can be represented as a nucleotide kmer where each amino acid gets mapped to a canonical (e.g. lexicographically smallest) codon. An amino acid FASTA file can then be mapped on-the-fly to a nucleotide file from which kmers can be gathered as normal. tblastn/tblastx-like matching can also be enabled, by adding options to do three- or six-frame translations of each input nucleotide sequence, then representing the resulting amino acid sequences as nucleotide sequence with canonical codons as above, before extracting kmers; this would again be done on-the-fly. So only the only change is to code that extracts kmers from FASTAs. @marekkokot

notestaff commented 4 years ago

(Or could do the same mapping on kmers after extracting them.)

ritah-nabunje commented 4 years ago

How about skip the translation step, and have amino acid sequences as input! Would KMC help?

notestaff commented 4 years ago

I’m guessing the same approach could work: back-translating the input AA sequences to canonical codons, for DNA-based representation inside KMC. @marekkokot ?