Open notestaff opened 4 years ago
(Or could do the same mapping on kmers after extracting them.)
How about skip the translation step, and have amino acid sequences as input! Would KMC help?
I’m guessing the same approach could work: back-translating the input AA sequences to canonical codons, for DNA-based representation inside KMC. @marekkokot ?
Add support for matching amino acid kmers. An amino acid kmer can be represented as a nucleotide kmer where each amino acid gets mapped to a canonical (e.g. lexicographically smallest) codon. An amino acid FASTA file can then be mapped on-the-fly to a nucleotide file from which kmers can be gathered as normal. tblastn/tblastx-like matching can also be enabled, by adding options to do three- or six-frame translations of each input nucleotide sequence, then representing the resulting amino acid sequences as nucleotide sequence with canonical codons as above, before extracting kmers; this would again be done on-the-fly. So only the only change is to code that extracts kmers from FASTAs. @marekkokot