ksahlin / strobemers

A repository for generating strobemers and evalaution
71 stars 11 forks source link

How to adjust the searching window for positions near the end of the sequence #2

Closed shenwei356 closed 3 years ago

shenwei356 commented 3 years ago

Hi, I'm a little confused about how to adjust the searching window for positions near the end of the sequence.

I know that these positions need to be kept for producing an equal number of strobemers as k-mers. But in sequence mapping/searching scenarios, unlike k-mers, the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.

In the function seq_to_randstrobes2_iter:

window_p_start = p + strobe_w_min_offset if p + strobe_w_max_offset <= len(hash_seq_list) else max( (p + strobe_w_min_offset) -  (p + strobe_w_max_offset - len(hash_seq_list)), p )
window_p_end = min(p + strobe_w_max_offset, len(hash_seq_list))

For positions near the end of the sequence (p + strobe_w_max_offset > len(hash_seq_list)),

max( (p + strobe_w_min_offset) -  (p + strobe_w_max_offset - len(hash_seq_list)), p )

equals to

max( len(hash_seq_list) - (strobe_w_max_offset - strobe_w_in_offset), p)

As I understand, it keeps the size of the searching window and moves the window to the left (box A in the figure below), am I right? Have you tried the way in box B?

Screenshot_20210413_230241

Besides, for order 3, the windows of m2 and m3 would have some duplicated regions, is this OK?

Screenshot_20210413_230706

shenwei356 commented 3 years ago

For computing the searching window of m2, I'm not sure if there's a bug.

https://github.com/ksahlin/strobemers/blob/main/modules/indexing.py#L189

window_p1_start = p + strobe_w_min_offset if p + 2*strobe_w_max_offset <= len(hash_seq_list) else max(p,  len(hash_seq_list)  + 2*(strobe_w_min_offset - strobe_w_max_offset))

Does the condition if p + 2*strobe_w_max_offset <= len(hash_seq_list) should be if p + strobe_w_max_offset <= len(hash_seq_list).

shenwei356 commented 3 years ago

My brain is down, would you please help to mark the correct windows in the illustration.xlsx. Thank you so much.

ksahlin commented 3 years ago

Hehe sure. Overall I don't know if it matters much, but nevertheless, it is good to be consistent.

I implemented alternative A in your first illustration (up to any offset bugs that you may find). That is, first the offset (w_min) decreases, then the window gets narrower at the very end.

For three or more, I implemented (and suggested in the paper) to scale down all segments equally with floor operator. With a "segment" here, I mean the range between the previous w_max to current w_max (i.e., a segment is w_max bases).

In practice, this leads to what I've drawn out in the figure below in the highlighted box. Note that as soon as there is not sufficient room, all the windows are divided. the number to the left shows the total number of positions to the last valid strobe position. there are two segments, to we get 9//2 = 4, 8//2 =4, 7//2=3, 6//2=3.

However, a final remark. If all this "extra checking" is detrimental to runtime, I think speed should be prioritized. One could consider stopping to generate strobemers whenever when one needs rescaling (or any other heuristic solution). Essentially, I don't think my particular scaling is significantly better than any other solution.

Screen Shot 2021-04-13 at 7 11 34 PM

shenwei356 commented 3 years ago

Thank you for the details.

ksahlin commented 3 years ago

@shenwei356 - It is amazing how I missed your main concern in this issue: "the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.", which can matter a lot for short sequences. This is a very good point that was also just raised by another user @lutfia95 which made me revisit this.

My opinion is now that one should pick whichever strategy that maximizes potential matches in the end, or alternatively stop before any modification is needed. Given this, your strategy B would be preferable.

Thanks for pointing this out and sorry for my initial answer which is not a good answer for read mapping applications.