jeff-k / bio-seq

Bit packed and well-typed biological sequences
MIT License
18 stars 3 forks source link

Masked sequences #5

Open J-Wall opened 2 weeks ago

J-Wall commented 2 weeks ago

Thanks for sharing this crate with the community Jeff. It looks really clean!

I have some suggestions for additional codecs to support masked sequences. Of course, I can easily just implement these in my own application thanks to your Codec derive macro, but following the philosophy of this crate, it might make sense to just implement them once for if others want to use them.

Often, we deal with DNA sequences which can be soft-masked, represented by lowercase, e.g.

ACGTaaatACGT
    |--| <- soft-masked region

or hard-masked, represented by N/n. Sometimes both at the same time. Therefore we have an alphabet of 10 characters; A, C, G, T, N, a, c, g, t, and n. 2⁴ would cover that with 6 spare, so you could throw in gaps - and padding . characters:

use bio_seq::codec::Codec;
use bio_seq_derive::Codec;

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[bits(4)]
#[repr(u8)]
pub enum MaskedDna {
    A = 0b0000,
    C = 0b0001,
    G = 0b0010,
    T = 0b0011,
    #[alt(0b1010, 0b1001)]
    N = 0b1000,
    #[display('a')]
    ASoftMasked = 0b0100,
    #[display('c')]
    CSoftMasked = 0b0101,
    #[display('g')]
    GSoftMasked = 0b0110,
    #[display('t')]
    TSoftMasked = 0b0111,
    #[display('n')]
    #[alt(0b1110, 0b1101)]
    NSoftMasked = 0b1100,
    #[display('-')]
    Gap = 0b1011,
    #[display('.')]
    Pad = 0b1111,
}

The obvious extension (and what I actually have a use-case for) is masked IUPAC sequences. Something like

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[bits(5)]
#[repr(u8)]
pub enum MaskedIupac {
    A = 0b01000,
    C = 0b00100,
    G = 0b00010,
    T = 0b00001,
    R = 0b01010,
    Y = 0b00101,
    S = 0b00110,
    W = 0b01001,
    K = 0b00011,
    M = 0b01100,
    B = 0b00111,
    D = 0b01011,
    H = 0b01101,
    V = 0b01110,
    N = 0b01111,
    #[display('a')]
    ASoftMasked = 0b11000,
    #[display('c')]
    CSoftMasked = 0b10100,
    #[display('g')]
    GSoftMasked = 0b10010,
    #[display('t')]
    TSoftMasked = 0b10001,
    #[display('r')]
    RSoftMasked = 0b11010,
    #[display('y')]
    YSoftMasked = 0b10101,
    #[display('s')]
    SSoftMasked = 0b10110,
    #[display('w')]
    WSoftMasked = 0b11001,
    #[display('k')]
    KSoftMasked = 0b10011,
    #[display('m')]
    MSoftMasked = 0b11100,
    #[display('b')]
    BSoftMasked = 0b10111,
    #[display('d')]
    DSoftMasked = 0b11011,
    #[display('h')]
    HSoftMasked = 0b11101,
    #[display('v')]
    VSoftMasked = 0b11110,
    #[display('n')]
    NSoftMasked = 0b11111,
    #[display('-')]
    Gap = 0b00000,
    #[display('.')]
    Pad = 0b10000,
}

Finally, one could add 3-bit representations of HardMaskedDna (which allows N, and maybe -/. but not lowercase letters), and 3-bit SoftMaskedDna, which allows acgt, but not N or -/.)

jeff-k commented 2 weeks ago

This is great and it would fit in perfectly with this project.

Maybe with the 6 extra codes afforded by the 4-bit representation of MaskedDna we can build in some clever encoding that gives us cheap complement and/or reverse operations. I'm thinking that bitwise-xor or reversing the bitstrings could perform these in one bitwise operation, like if A is 1000, a is 0111, T is 0001, and t is 1110. Then reversing the bits complements them (and reverse complements a Seq<MaskedDna>) and xoring the bits masks them. It works with G and C too, but will require some thinking about the gaps and Ns. For some workloads this encoding could be faster than the 2-bit encoding.

I suppose there could be something clever to do with MaskedIupac but it might require 6 bits. The 5-bit encoding you have is quite appealingly a complete 32 characters.

If you want to add this codec in then a pull request is certainly welcome.

The one API thing to consider is that maybe we want a masked module, and could call these masked::Dna and masked::Iupac. That would fit in with the idea I had to use text::{Dna,Iupac,Amino} for the 8-bit encodings. I'm open to feedback on this design.