ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

End coordinate of match in PAF output #333

Closed marcelm closed 7 months ago

marcelm commented 10 months ago

PAF output in strobealign uses these two values for the end position of the match on the query and reference:

The *_prev_hit_startpos attributes describe the start position of the last hit in the NAM. Adding k to that gives the end position of the first strobe of the last strobemer. So apparently, the intention here is to not include the gap between the two strobes in the match. Asked about why this i the case, Kristoffer replied:

This is probably a bug. If there was some intent for this I have forgotten it. Also, I cannot find a reason for it being like that. If you cannot think of any reason either, I vote for changing it to end_pos.

If the attributes are not needed for PAF output, it would be possible (with some other adjustments) to remove them entirely from the Nam struct.

ksahlin commented 10 months ago

IIRC they are used to assure that we are not merging 'non-colinear' matches when going from matches to NAMs. We keep the start position of the last match we merged into the nam on the reference, when we consider the next match it has to 'move forward' on both query and reference from the last match we merged into the NAM. This is guaranteed on the query, because of how we traverse matches n the query, but not guaranteed on the reference.

But the issue is correct in that it is probably a bug in the PAF end coordinate output. Also, for above mentioned reason nam.query_prev_hit_startpos could probably be removed (unless it has another purpose in the code).

marcelm commented 10 months ago

IIRC they are used to assure that we are not merging 'non-colinear' matches when going from matches to NAMs. We keep the start position of the last match we merged into the nam on the reference

Yes, that’s roughly how I understood the code. I played around with this particular part and it makes virtually no difference in accuracy if the comparison is changed to refer to .query_start instead of .query_prev_hit_startpos. But I’m still working on this.