Open J-Wall opened 2 weeks ago
This is great and it would fit in perfectly with this project.
Maybe with the 6 extra codes afforded by the 4-bit representation of MaskedDna
we can build in some clever encoding that gives us cheap complement and/or reverse operations. I'm thinking that bitwise-xor or reversing the bitstrings could perform these in one bitwise operation, like if A
is 1000
, a
is 0111
, T
is 0001
, and t
is 1110
. Then reversing the bits complements them (and reverse complements a Seq<MaskedDna>
) and xor
ing the bits masks them. It works with G
and C
too, but will require some thinking about the gaps and N
s. For some workloads this encoding could be faster than the 2-bit encoding.
I suppose there could be something clever to do with MaskedIupac
but it might require 6 bits. The 5-bit encoding you have is quite appealingly a complete 32 characters.
If you want to add this codec in then a pull request is certainly welcome.
The one API thing to consider is that maybe we want a masked
module, and could call these masked::Dna
and masked::Iupac
. That would fit in with the idea I had to use text::{Dna,Iupac,Amino}
for the 8-bit encodings. I'm open to feedback on this design.
Thanks for sharing this crate with the community Jeff. It looks really clean!
I have some suggestions for additional codecs to support masked sequences. Of course, I can easily just implement these in my own application thanks to your
Codec
derive macro, but following the philosophy of this crate, it might make sense to just implement them once for if others want to use them.Often, we deal with DNA sequences which can be soft-masked, represented by lowercase, e.g.
or hard-masked, represented by
N
/n
. Sometimes both at the same time. Therefore we have an alphabet of 10 characters;A
,C
,G
,T
,N
,a
,c
,g
,t
, andn
. 2⁴ would cover that with 6 spare, so you could throw in gaps-
and padding.
characters:The obvious extension (and what I actually have a use-case for) is masked IUPAC sequences. Something like
Finally, one could add 3-bit representations of
HardMaskedDna
(which allowsN
, and maybe-
/.
but not lowercase letters), and 3-bitSoftMaskedDna
, which allowsacgt
, but notN
or-
/.
)