neherlab / nextalign

🧬 Viral genome reference alignment
MIT License
12 stars 2 forks source link

Use integer-based strings for sequences #9

Closed ivan-aksamentov closed 3 years ago

ivan-aksamentov commented 3 years ago

This converts normal string sequences to a new custom string type, where every character is encoded using these tables:

https://github.com/neherlab/nextalign/blob/90840676937206feb9f39f72ebdf6681e5326bf4/packages/nextalign/src/alphabet/nucleotides.h#L8-L27

https://github.com/neherlab/nextalign/blob/90840676937206feb9f39f72ebdf6681e5326bf4/packages/nextalign/src/alphabet/aminoacids.h#L12-L42

This allows all used characters to be in a contiguous range and allows to use character codes as indices to the array when making match score lookups from these matrices:

https://github.com/neherlab/nextalign/blob/90840676937206feb9f39f72ebdf6681e5326bf4/packages/nextalign/src/matchNuc.cpp#L10-L30

https://github.com/neherlab/nextalign/blob/90840676937206feb9f39f72ebdf6681e5326bf4/packages/nextalign/src/matchAa.cpp#L10-L41

This is supposed to be much faster, because we are avoiding character conversion on every lookup, however this still needs to be benchmarked.