jeff-k / bio-seq

Bit packed and well-typed biological sequences
MIT License
21 stars 2 forks source link

Add support for multiple genetic codes #3

Closed Benjamin-Lee closed 9 months ago

Benjamin-Lee commented 1 year ago

I've been looking all over for a Rust library that supports multiple genetic codes. This library seems to be the best translation library, so I hope this is on your radar. Thanks!

jeff-k commented 1 year ago

Hey @Benjamin-Lee, this is the type of use-case I'd like to cover with this. Thanks!

I'm thinking that the current From<Kmer<Dna, 3>> for Amino isn't the way to go. For safety and generality it would be cool if

So with:


trait TranslationTable<A: Codec> {
    fn to_amino(self, codon: Kmer<A, 3>) -> Amino;
    fn to_codon(self, amino: Amino) -> Option<Kmer<A, 3>>;
}

There can be a Standard struct that implements these methods the same way that translation is implemented now with zero cost (I'm not a fan of the all-caps name, clippy is):

const STANDARD: Standard = Standard;

But then users can define their own Custom translation table:

impl Custom {
    pub fn from_vec(table: Vec<Amino>) -> Self {
        Custom { table }
    }
}

impl<A: Codec> TranslationTable<A> for Custom {
    fn to_amino(self, codon: Kmer<A, 3>) -> Amino {
        self.table[Into::<usize>::into(codon)]
    }

I've pushed these changes but they aren't ready yet. The old test cases have been updated:

       let seq: Seq<Dna> =
            dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");
        let aminos: Seq<Amino> = seq
            .kmers()
            .map(|kmer| STANDARD.to_amino(kmer))
            .collect::<Seq<Amino>>();

and the usage looks very similar to the original. That type annotation on collect is probably unnecessary now.

What I'm mulling over now is how to, or whether to support fallible translations. This is tempting because I want to support IUPAC to amino acid sequences. A lot of the triples of iupac ambiguity codes unambiguously represent amino acids (and vice-versa!) so we'd be able to get rust's type system to cover some error prone uses.

The goal is that this library could be something someone uses to implement a tool like FACIL or Codetta in rust.

jeff-k commented 9 months ago

I'm somewhat satisfied with this for now. There are details at the bottom of the README now, and lots of test cases in the translation module:

pub trait TranslationTable<A: Codec, B: Codec> {
    fn to_amino(&self, codon: &SeqSlice<A>) -> B;
    fn to_codon(&self, amino: B) -> Result<Seq<A>, TranslationError>;
}

/// A partial translation table where not all triples of characters map to amino acids
pub trait PartialTranslationTable<A: Codec, B: Codec> {
    fn try_to_amino(&self, codon: &SeqSlice<A>) -> Result<B, TranslationError>;
    fn try_to_codon(&self, amino: B) -> Result<Seq<A>, TranslationError>;
}

The standard genetic code is provided in the translation::standard module:

use crate::prelude::*;
use crate::translation::standard::STANDARD;
use crate::translation::TranslationTable;

let seq: Seq<Dna> =
    dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");

let aminos: Seq<Amino> = seq
    .windows(3)
    .map(|codon| STANDARD.to_amino(&codon))
    .collect::<Seq<Amino>>();

assert_eq!(
    aminos,
    amino!("NIFLCVWGGVFSRVSLCARGALSPRAPPLL*SVYTLYM*ERGDTRDISQSAHTPHI*KRENTQK")
);

Instantiate a custom translation table from a type that implements Into<HashMap<Seq<A>, B>>:

let codon_mapping: [(Seq<Dna>, Amino); 6] = [
    (dna!("AAA"), Amino::A),
    (dna!("ATG"), Amino::A),
    (dna!("CCC"), Amino::C),
    (dna!("GGG"), Amino::E),
    (dna!("TTT"), Amino::D),
    (dna!("TTA"), Amino::F),
];

let table = CodonTable::from_map(codon_mapping);

let seq: Seq<Dna> = dna!("AAACCCGGGTTTTTATTAATG");
let mut amino_seq: Seq<Amino> = Seq::new();

for codon in seq.chunks(3) {
    amino_seq.push(table.try_to_amino(codon).unwrap());
}

assert_eq!(amino_seq, amino!("ACEDFFA"));

Implementing the TranslationTable trait directly:

struct Mitochondria;

impl TranslationTable<Dna, Amino> for Mitochondria {
    fn to_amino(&self, codon: &SeqSlice<Dna>) -> Amino {
        if *codon == dna!("AGA") {
            Amino::X
        } else if *codon == dna!("AGG") {
            Amino::X
        } else if *codon == dna!("ATA") {
            Amino::M
        } else if *codon == dna!("TGA") {
            Amino::W
        } else {
            Amino::unsafe_from_bits(Into::<u8>::into(codon))
        }
    }

    fn to_codon(&self, _amino: Amino) -> Result<Seq<Dna>, TranslationError> {
        unimplemented!()
    }
}