amino acid kmer utility

jianshu93 commented 3 years ago

Hello Jeff,

It's really nice that you have this bit by bit encoding of sequences for DNA. You said on todo list that amino acid support will also be available. I think we need at least 8 bit right instead of 2 or 4 for AGCT, and IUAC dna (16) because there are 20 amino acid and more for IUAC rules. I am doing amino acid kmer counting for a huge database of bacterial genomes in amino acid format and I cannot find any bit by bit encoding and decoding of amino acid sequences except this one. Will amino acid kmer available soon? I will be very happy to work on this with you or whatever. email is: jianshu.zhao@gatech.edu

Many thanks,

Jianshu

jeff-k commented 3 years ago

Hi Jianshu, I've created a branch to work this out in: https://github.com/jeff-k/bio-seq/blob/amino/src/alphabet/amino.rs

I don't have a huge amount of experience working with protein sequences myself but your use-case sounds like a good fit for this project. I would be very interested in understanding better how to design a bit-string representation for amino acids that's useful.

From my naive perspective I can think of:

a 5 bit (log2(21) mapping to the IUPAC codes + Stop)
a 6 bit code that conveniently matches the size of 3 bases in the 2-bit encoding
8 bits, effectively just using the ASCII representation of the IUPAC codes

I've gone with the 6-bit code for now. I'm interested in the rules you mentioned, though.

There will also have to be consideration that this just covers the standard code.

In addition to the encoding we choose, I've noted what I think might be some useful conversions to implement in the README. For example, Seq<Dna> to Seq<Amino>; Seq<Amino> to Seq<Iupac>; and Seq<Iupac> to an iterator of Seq<Dna>.

Are there any particular operations on the Amino sequences that would be useful?

jianshu93 commented 3 years ago

Hello Jeff,

See an interesting one here: https://github.com/jean-pierreBoth/kmerutils/blob/master/src/base/alphabet.rs

It's only for DNA (AGCT) using 2 bit. I think we can encode 5 bit (which is enough for bioinformatic predicted gene in aa format, 20 only) for amino acid. I think 8 bit is only for IUAC rule amino acid where experimental testing of amino acid is not clear (e.g., can be either 2 different aa), but for bioinformatically predicted gene in aa format according to their genomes, 20 is good (stop Condon is also not included)

See an attached amino acid gene file predicted by the famous bacteria gene prediction software prodigal (attached)

pico127_pico127.002-1.faa.zip

What do you think? Compression is a bit more complicated for 5 bit.

Jianshu

jeff-k commented 2 years ago

I've made a procedural macro that makes it a lot nicer to define custom encodings: https://github.com/jeff-k/bio-seq/blob/main/src/codec/amino.rs

And the Kmer struct is now generalised for codecs too, so we can do:

for kmer in amino!("SSLMNHKKL").kmers::<3>() { }

Note that kmers are now backed by usize, so K is limited to 10 in the case of these 6-bit coded amino acids. However I think it makes sense to have a bigk::Kmer<K> type too that's backed by an arbitrary length bitvector.

What do you think?

I still have to figure out ownership stuff and conversions between DNA and amino acid types.

jeff-k commented 2 years ago

Hello @jianshu93, it's been a while since you posted this but your use case has been a helpful feature suggestion.

If you're still interested, I'm experimenting with another library that opens fastq/fasta files as bio-seq structs: https://github.com/jeff-k/bio-streams

Loading a fasta as amino acid sequences in the 6-bit encoding looks like:

let faa: Fasta<BufReader<File>, Seq<Amino>> = Fasta::new(BufReader::new(File::open(&path).unwrap()));

And then k-mers for the contigs in this file can be iterated over:

for contig in faa {
    for kmer in contig.seq.kmers::<K>() {
        ....

There's an example that collects some basic stats here: https://github.com/jeff-k/bio-seq/blob/main/examples/aminokmers.rs, your attached file was very helpful for testing this. It can be run from the bio-seq repo with:

$ cargo build --release --examples
$ target/release/examples/aminokmers pico127_pico127.002-1.faa
...
LAGL    132
ALGL    140
AALL    136
LALL    101
AAAV    104
AALS    105
Read 2800 contigs. Showing counts of 4-mers that occur more than 100 times (813864 total). 16648208/16777216 of possible 4-mers have a count of 0.

jeff-k / bio-seq

amino acid kmer utility #1