A few questions about my application (changing attributes, ultralow endsfree, endsfree API, max_score heuristics)

feldroop commented 6 months ago

Hi!

First of all, thank you a lot for providing this intriguing and well-documented piece of software.

I am working on an experimental longread aligner and currently testing WFA2 as the alignment backend. While doing so, I encountered a few issues.

As recommended, I want to reuse the wavefront_aligner_t objects. However, I need to change some of the attributes before each alignment invocation. Which, if any, of the attributes are safe to change on the wavefront_aligner_t object (i.e. after calling wavefront_aligner_new)? The most important one for me are the _begin/end_free values of the alignment_span. Next would be the alignment_scope, but here I could also store two aligners instead of changing. Finally, I was thinking about varying the memory_mode depending on the size of the input sequences. From looking at the code of wavefront_aligner_new, it seems like the _begin/end_free might be safe to change after construction, the other mentioned attributes not. Is this correct?
The longreads in my data sets are up to 600K bases long and I need ends free alignment. I saw in another issue that you recommended the ultralow memory mode for input sequences of this size. There is no urgency whatsoever, but it would be nice for me to have the ultralow mode available for ends free alignments.
What I want to do is a glocal alignment, like the below example from your README. For my application, I need two pieces of information.
- I need the index of the text, where the actual alignment begins, without free end gaps. In the example, it would be 13. I initially expexted the value wf_aligner->cigar->begin_offset to be exactly this, but it is 0 in the example (end_offset accordingly is 50). So my question is, what exactly are the begin/end_offset values and how do I (conveniently) get the value I want? Is there an easier option than looking at the cigar and doing the offset calculations?
- I need the CIGAR, also excluding free end insertions. In the example, it would be "9=1X12=", but the API returns "13I9=1X12=15I". It is of course possible to trim the cigar by hand, but it would be nicer to have something like cigar_sprint_SAM_CIGAR_endsfree. I found the cigar_maxtrim_gap_linear function, but it only trims the gaps from the end.

    PATTERN    -------------AATTTAAGTCTAGGCTACTTTC---------------
                            ||||||||| ||||||||||||
    TEXT       ACGACTACTACGAAATTTAAGTATAGGCTACTTTCCGTACGTACGTACGT

I need the optimal alignment, but I have an upper bound on the number of allowed errors (=maximum score in edit distance). I did not fully investigate the heuristic options yet, but I believe a banded alignment would be fine for this case, maybe even adaptive? Again, there is no urgency whatsoever, but it would be lovely to have a convenience config attribute like max_score. It would automatically use all heuristics that preserve the optimal score below the configured max_score. Just an idea from lazy me :D

feldroop commented 5 months ago

Hey, sorry for closing without a message. I was just about to write this explanation. The scope of my project shifted a bit and for now, I don't need to use this library anymore. But I'm still interested in its development and happy to discuss the above points.

smarco commented 5 months ago

Hi, @feldroop.

Sorry for not answering sooner, and thank you for your nice words. Your experimental long-read aligner sounds really cool. I am familiar with many of those ideas, and I really love the direction you are taking.

If you are still interested, let's go through your list:

Yes, I should document which ones are safe to be changed (TODO). But surely you can change begin/end_free values of the alignment_span. However, you cannot switch from score-only to alignment nor change the memory-mode (this requires a new wavefront_aligner_t object).
Ok, noted. Nobody asked before for BiWFA + ends-free. Conceptually, there is no problem with implementing it. I can put that on TODO (if there is no rush and you are still interested).

So my question is, what exactly are the begin/end_offset values and how do I (conveniently) get the value I want? Is there an easier option than looking at the cigar and doing the offset calculations?

(a) wf_aligner->cigar->begin_offset are the offsets to the internal buffer that stores the CIGAR operations (i.e., MXID). So, nothing to do with the offset where the glocal alignment starts/ends. In fact, WFA resulting CIGAR is always described from the beginning of the query/pattern and text to their end. Glocal mode (aka ends-free or semi-global) just adjusts the algorithm's alignment-model penalties, so leading & trailing deletions/insertions come for free (according to the user's configuration).

but it would be nicer to have something like cigar_sprint_SAM_CIGAR_endsfree. I found the cigar_maxtrim_gap_linear function, but it only trims the gaps from the end.

(b) There is no such function implemented at the moment. Sorry. But it should be easy to implement such function (whether on the WFA-lib side or user's code side).
Yes, I can do that. Using edit-distance, a fixed band (implemented) will comply with your restriction. However, with other heuristics, making such optimality guarantees is difficult (i.e., I have to think).

smarco commented 5 months ago

Hey, sorry for closing without a message. I was just about to write this explanation. The scope of my project shifted a bit and for now, I don't need to use this library anymore. But I'm still interested in its development and happy to discuss the above points.

It is ok. No problem at all :-)

smarco / WFA2-lib

A few questions about my application (changing attributes, ultralow endsfree, endsfree API, max_score heuristics) #93