ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Multi-context seeds plus fixes and optimized parameters #426

Open marcelm opened 1 month ago

marcelm commented 1 month ago

I realized only now that #388 is incomplete and that @ksahlin’s updates were made in a separate branch. We need a place to review this mcs-optimized-parameters branch and someplace where we have a press "Merge", so this is a separate PR that supersedes #388.

I’m making quite some changes to this branch (squashing commits, changing commit messages etc.). If anyone wants the original branch without my modifications, it is available as mcs-optimized-parameters-backup.

Original parameter optimization was done with the commit that has the description "Fix so that partial rescue hits are added properly". The commit hash hash changed due to history rewriting.

To Do

marcelm commented 1 month ago

Commit 21599d9f0d1bd26223d43ffc6403bb1e838d74bf "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the rescuable.43 read in the tiny test dataset is no longer rescuable. Shortened diff of read 1:

-rescuable.43   83  NC_001422.1  3137  60  120S14=1X4=1X9=1X4=1X4=1X9=1X4=1X4=1X4=1X9=1X19=1X4=1X4=1X4=1X4=1X4=1X4=1X9=1X4=1X9=1X4=1X9=1X4=1X4=1X4=1X      =       2955    -362    (sequence)       (qual)       NM:i:25 AS:i:120        RG:Z:1
+rescuable.43   69  NC_001422.1  2955  0  *  =  2955  0       (sequence)  (qual)  RG:Z:1
marcelm commented 3 weeks ago

Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the rescuable.43 read in the tiny test dataset is no longer rescuable. [...]

I looked into this. The problem is that this heuristic check now returns false: https://github.com/ksahlin/strobealign/blob/3a97f6b817824235a36ca8f9c5710ee533d3ad56/src/aln.cpp#L482

This happens because $k$ was increased from 22 to 25. With $k=22$, there were shared $k$-mers, but there are no longer any with $k=25$.

I think the easiest fix for now is to pass -k 22 on the command-line when running that particular test, but maybe it also makes sense to reconsider how the heuristic works (perhaps it could depend less on $k$). But that would be something independent of this PR.

ksahlin commented 3 weeks ago

Yes, sounds good to reconsider the heuristic or make it independent of k.