ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
136 stars 17 forks source link

Base level alignment #26

Open ksahlin opened 2 years ago

ksahlin commented 2 years ago

Note to developer:

The extension step (nucleotide level alignment) is the bottleneck in strobealign. There are different three ways to reduce this:

  1. Direction 1 (change the alignment module):
    1. Change to base level alignment with WFA (WFA publ) as is done in Accelalign
  2. Direction 2 Speedup the current module used (SSW):
    1. By using 8bit slots in alignment matrix?
    2. By not computing alignment twice - is ssw does this?
  3. Direction 3 (partitioned SSW)
    1. Finish implementing partitioned SW (split alignment into several small hamming or SW alignments) if seeds are in middle.
ksahlin commented 1 year ago

I think this is worth exploring WFA as we are currently relying on SSW which has its issues being a local aligner as mentioned in issue #54. The method seems to have great performance, see table2 in WFA paper for timings to e.g. SeqAn and ksw2 in Table 2. Furthermore, the maturity of the implementation is here with WFA2 in terms of providing different penalty models (including dual gap cost penalties!), traceback cigar etc,

Also in strobealign the extension step is over 50% of the runtime for many biological datasets, see 'aln' field for BIO150 and BIO250 in attached figure for extension with SSW.

image
ekg commented 1 year ago

This will also make it easy to scale to longer sequences, provided your seed chaining can do it. BiWFA uses order divergence space and is consequently very cache coherent and actually fast even for ~500bp sequences.

ksahlin commented 1 year ago

That's a good point. As mentioned in https://github.com/ksahlin/strobealign/issues/24, extending to alignments of long reads is one objective. Strobealign's seeds should be very suitable for long reads, so BiWFA may be the way to go already at start. We are currently exploring WFA (https://github.com/ksahlin/strobealign/pull/229), but we have yet to find a way to use the library efficiently compared to SSW.