jeff-k / bio-seq

Bit packed and well-typed biological sequences
MIT License
21 stars 2 forks source link

Masked sequences #5

Open J-Wall opened 2 months ago

J-Wall commented 2 months ago

Thanks for sharing this crate with the community Jeff. It looks really clean!

I have some suggestions for additional codecs to support masked sequences. Of course, I can easily just implement these in my own application thanks to your Codec derive macro, but following the philosophy of this crate, it might make sense to just implement them once for if others want to use them.

Often, we deal with DNA sequences which can be soft-masked, represented by lowercase, e.g.

ACGTaaatACGT
    |--| <- soft-masked region

or hard-masked, represented by N/n. Sometimes both at the same time. Therefore we have an alphabet of 10 characters; A, C, G, T, N, a, c, g, t, and n. 2⁴ would cover that with 6 spare, so you could throw in gaps - and padding . characters:

use bio_seq::codec::Codec;
use bio_seq_derive::Codec;

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[bits(4)]
#[repr(u8)]
pub enum MaskedDna {
    A = 0b0000,
    C = 0b0001,
    G = 0b0010,
    T = 0b0011,
    #[alt(0b1010, 0b1001)]
    N = 0b1000,
    #[display('a')]
    ASoftMasked = 0b0100,
    #[display('c')]
    CSoftMasked = 0b0101,
    #[display('g')]
    GSoftMasked = 0b0110,
    #[display('t')]
    TSoftMasked = 0b0111,
    #[display('n')]
    #[alt(0b1110, 0b1101)]
    NSoftMasked = 0b1100,
    #[display('-')]
    Gap = 0b1011,
    #[display('.')]
    Pad = 0b1111,
}

The obvious extension (and what I actually have a use-case for) is masked IUPAC sequences. Something like

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[bits(5)]
#[repr(u8)]
pub enum MaskedIupac {
    A = 0b01000,
    C = 0b00100,
    G = 0b00010,
    T = 0b00001,
    R = 0b01010,
    Y = 0b00101,
    S = 0b00110,
    W = 0b01001,
    K = 0b00011,
    M = 0b01100,
    B = 0b00111,
    D = 0b01011,
    H = 0b01101,
    V = 0b01110,
    N = 0b01111,
    #[display('a')]
    ASoftMasked = 0b11000,
    #[display('c')]
    CSoftMasked = 0b10100,
    #[display('g')]
    GSoftMasked = 0b10010,
    #[display('t')]
    TSoftMasked = 0b10001,
    #[display('r')]
    RSoftMasked = 0b11010,
    #[display('y')]
    YSoftMasked = 0b10101,
    #[display('s')]
    SSoftMasked = 0b10110,
    #[display('w')]
    WSoftMasked = 0b11001,
    #[display('k')]
    KSoftMasked = 0b10011,
    #[display('m')]
    MSoftMasked = 0b11100,
    #[display('b')]
    BSoftMasked = 0b10111,
    #[display('d')]
    DSoftMasked = 0b11011,
    #[display('h')]
    HSoftMasked = 0b11101,
    #[display('v')]
    VSoftMasked = 0b11110,
    #[display('n')]
    NSoftMasked = 0b11111,
    #[display('-')]
    Gap = 0b00000,
    #[display('.')]
    Pad = 0b10000,
}

Finally, one could add 3-bit representations of HardMaskedDna (which allows N, and maybe -/. but not lowercase letters), and 3-bit SoftMaskedDna, which allows acgt, but not N or -/.)

jeff-k commented 2 months ago

This is great and it would fit in perfectly with this project.

Maybe with the 6 extra codes afforded by the 4-bit representation of MaskedDna we can build in some clever encoding that gives us cheap complement and/or reverse operations. I'm thinking that bitwise-xor or reversing the bitstrings could perform these in one bitwise operation, like if A is 1000, a is 0111, T is 0001, and t is 1110. Then reversing the bits complements them (and reverse complements a Seq<MaskedDna>) and xoring the bits masks them. It works with G and C too, but will require some thinking about the gaps and Ns. For some workloads this encoding could be faster than the 2-bit encoding.

I suppose there could be something clever to do with MaskedIupac but it might require 6 bits. The 5-bit encoding you have is quite appealingly a complete 32 characters.

If you want to add this codec in then a pull request is certainly welcome.

The one API thing to consider is that maybe we want a masked module, and could call these masked::Dna and masked::Iupac. That would fit in with the idea I had to use text::{Dna,Iupac,Amino} for the 8-bit encodings. I'm open to feedback on this design.

jeff-k commented 2 months ago

Hey @J-Wall I've made a commit that has a masked encoding module. Here's what usage looks like:

#[cfg(test)]
mod tests {
    use crate::codec::masked;
    use crate::prelude::*;

    #[test]
    fn mask_sequence() {
        let seq = Seq::<masked::Dna>::try_from("A.TCGCgtcataN--A").unwrap();

        assert_eq!(seq.mask().to_string(), "a.tcgcGTCATAn--a".to_string());
    }

    #[test]
    fn masked_revcomp() {
        let seq = Seq::<masked::Dna>::try_from("A.TCGCgtcataN--A").unwrap();

        assert_eq!(seq.revcomp().to_string(), "T--NtatgacGCGA.T".to_string());
    }
}

If instead of sequential bit sequences we choose these patterns, we get efficient revcomp and mask operations by reversing or inverting the Seq<masked::Dna> bitvecs too:

IMG_2833

It's quite lucky: there are 2 groups of bit sequences with the symmetry that satisfies the square relationships we need for 4 nucleotides, one maskable group that's its own complement, N, and two groups that aren't maskable and equal to their own complements (-, .):

    N = 0b0000,
    #[display('n')]
    NMasked = 0b1111,

    #[display('-')]
    #[alt(0b0011)]
    Gap = 0b1100,

    #[display('.')]
    #[alt(0b0101)]
    Pad = 0b1010,

which leaves us with two extra symbols that complement to themselves but mask to eachother:

    #[display('?')]
    Unknown1 = 0b0110,

    #[display('!')]
    Unknown2 = 0b1001,

Any ideas what we could use them for?

I tried to have a think about masked IUPAC encodings but I got a headache. I think the 5-bit version makes the most sense. Have you tried it out?

J-Wall commented 2 months ago

Hi @jeff-k, this looks really cool.

Any ideas what we could use them for?

I tried to have a look online if there were any standards/commonly use characters. I think ? makes sense, maybe with X?

Regarding the API, I think maybe rename Seq<codec::masked::Dna>::mask to invert_mask, or something similar? Then .mask() and .unmask() could be one-direction conversions

I tried to have a think about masked IUPAC encodings but I got a headache. I think the 5-bit version makes the most sense. Have you tried it out?

I wrote the relationships out, and with a 5-bit representation, there are 6 "square-relationship groups" and 4 "maskable self-complementing groups". Unfortunately the IUPAC alphabet would require 7 "square-relationship groups" and (at least) 2 "maskable self-complementing groups" (N/n & -/.), so it doesn't really fit with simple bit-negation and reversing. It's very compact though, providing you're happy to disallow either softmasked ns or gaps so it all fits.

I also wrote out the 6-bit relationships as well. There you have 13 "square groups" and 6 "maskable self-complement groups". So you easily fit it in to that with enough spare to also include U (with an RNA version of A) and anything else which is in the masked DNA codec

Reduced IUPAC

This exercise got me thinking about if there was an even more compact representation of the masked IUPAC alphabet. This is a bit lossy and might be a bit specific to my application, but bear with me: I think some software treats the 3-fold degenerate bases (BDHV) as N. If you do that, you only need to represent ACGTRYSWKMN-. Now (in my application at least, not sure if others would find this useful), the degenerate bases only occur in soft-masked regions, so for my purposes, if I'm happy to drop the 3-fold degenerate bases, the alphabet ACGTNacgtryswkm- (16 symbols) would be sufficient, which means I could fit it in 4 bits. Obviously I wouldn't get the bitwise operation optimisations on revcomp and masking, but in my mind space is more expensive than compute (and it's not like these are particularly expensive transformations in the first place).

jeff-k commented 2 months ago

Yeah, I guess inverting the mask isn't actually as useful as setting it or unsetting it over a region. So we'll want to implement mask for mutable slices, too.

What's appealing about the 5-bit alphabet is that the leading bit sets the mask, so implementing mask/unmask is just a matter of flipping it and we don't need a lookup table or 32 arm match statement. I can merge it in if you have a branch?

I really like the idea of the degenerate encodings. Actually, I've been playing with a 1-bit encoding of S (G/C) and W (A/T) for minimiser/species identification stuff (not to mention quite an efficient complement operation!) I wouldn't mind having a feature gate for some experimental encodings. I'm sure other people could have fun with these.

Similarly for amino acids. If we encode amino acids into 8 categories (hydrophobic/positive charge etc.) that's also 1-bit per base. Less than 1-bit with 4 categories. I think some crystallography data only resolves residues at that level of detail.

Anyway, a big motivation for this crate is to be able to swap out encodings at leisure so I'm all for showing off the possibilities.

jeff-k commented 2 months ago

For the 5-bit masked IUPAC encoding, if we put the mask bit in the centre of the sequence we'll a cheap reverse complement operation:

A C mask flag G T
A 1 0 0 0 0
C 0 1 0 0 0
G 0 0 0 1 0
T 0 0 0 0 1
a 1 0 1 0 0
c 0 1 1 0 0
g 0 0 1 1 0
t 0 0 1 0 1
S 0 1 0 1 0
W 1 0 0 0 1

although I guess the appropriate thing to do would be to set up a proper benchmark suite before going further into this kind of optimisation