Open ksahlin opened 3 months ago
In the above comment, I had benchmarked with asymmetric seeds and concluded no significant advantage. Asymmetric seeds currently have other issues, which doesn't give a clean benchmark.
Therefore, I decided to implement the above described 'slack' based off the main branch (which uses symmetric seeds).
Below I provide a benchmark on a small chrX+Y dataset using strobealign-symmetric
(current main) vs strobealign-symmetric-no-rescue-slack
which:
if (nams.empty())
instead of if (nams.empty() || nonrepetitive_fraction < 0.7)
In single end mode I also include strobealign-symmetric-no-rescue
with only includes the no-rescue 'feature'.
I observe quite drastic accuracy improvements on all but the 50nt reads with the strobealign-symmetric-no-rescue-slack
. In addition, the no-rescue version is always faster.
It should be carefully noted that this is a small dataset. The rescue might do a better job on a full human genome(?).
tool,ref,%-aligned,accuracy,time(sec),Mem(MB)
chm13_chrX_Y,50,strobealign-symmetric,align,99.929,**78.192**,0,14.09,1192.62
chm13_chrX_Y,50,strobealign-symmetric-no-rescue-slack,align,99.9245,78.1155,0,7.62,1192.88
chm13_chrX_Y,75,strobealign-symmetric,align,100.0,79.8775,0,12.2,1193.8
chm13_chrX_Y,75,strobealign-symmetric-no-rescue-slack,align,99.9995,**79.979**,0,6.21,1194.99
chm13_chrX_Y,125,strobealign-symmetric,align,100.0,82.0875,0,11.99,1196.45
chm13_chrX_Y,125,strobealign-symmetric-no-rescue-slack,align,100.0,82.085,0,7.99,1201.92
chm13_chrX_Y,150,strobealign-symmetric,align,100.0,82.7435,0,13.02,1198.98
chm13_chrX_Y,150,strobealign-symmetric-no-rescue-slack,align,100.0,**82.891**,0,10.02,1205.21
chm13_chrX_Y,250,strobealign-symmetric,align,99.999,85.882,0,20.06,1209.32
chm13_chrX_Y,250,strobealign-symmetric-no-rescue-slack,align,99.999,**86.1335**,0,19.2,1220.66
chm13_chrX_Y,500,strobealign-symmetric,align,99.998,87.9965,0,37.28,969.72
chm13_chrX_Y,500,strobealign-symmetric-no-rescue-slack,align,99.998,**88.4225**,0,32.64,987.56
tool,ref,%-aligned,accuracy,time(sec),Mem(MB)
chm13_chrX_Y,50,strobealign-symmetric_SE,align,48.9205,**36.306**,0,3.31,1187.66
chm13_chrX_Y,50,strobealign-symmetric-no-rescue_SE,align,48.9205,36.2685,0,1.75,1187.69
chm13_chrX_Y,50,strobealign-symmetric-no-rescue-slack_SE,align,48.9205,36.2665,0,1.72,1187.57
chm13_chrX_Y,75,strobealign-symmetric_SE,align,49.8825,38.312,0,3.31,1187.1
chm13_chrX_Y,75,strobealign-symmetric-no-rescue_SE,align,49.8825,**38.3335**,0,1.74,1188.0
chm13_chrX_Y,75,strobealign-symmetric-no-rescue-slack_SE,align,49.8825,38.323,0,1.74,1188.52
chm13_chrX_Y,125,strobealign-symmetric_SE,align,49.9745,39.5835,0,3.72,1190.74
chm13_chrX_Y,125,strobealign-symmetric-no-rescue_SE,align,49.9745,39.5495,0,1.99,1190.32
chm13_chrX_Y,125,strobealign-symmetric-no-rescue-slack_SE,align,49.9745,**39.632**,0,2.02,1190.86
chm13_chrX_Y,150,strobealign-symmetric_SE,align,49.995,39.985,0,4.57,1192.77
chm13_chrX_Y,150,strobealign-symmetric-no-rescue_SE,align,49.995,39.9595,0,2.28,1195.56
chm13_chrX_Y,150,strobealign-symmetric-no-rescue-slack_SE,align,49.995,**40.004**,0,2.39,1195.71
chm13_chrX_Y,250,strobealign-symmetric_SE,align,49.999,41.1665,0,6.74,1197.88
chm13_chrX_Y,250,strobealign-symmetric-no-rescue_SE,align,49.999,41.2015,0,4.49,1210.02
chm13_chrX_Y,250,strobealign-symmetric-no-rescue-slack_SE,align,49.999,**41.2145**,0,4.81,1204.21
chm13_chrX_Y,500,strobealign-symmetric_SE,align,49.999,42.716,0,17.1,943.99
chm13_chrX_Y,500,strobealign-symmetric-no-rescue_SE,align,49.999,**42.8755**,0,12.23,945.01
chm13_chrX_Y,500,strobealign-symmetric-no-rescue-slack_SE,align,49.999,42.8645,0,12.53,958.21
(CC @Itolstoganov)
The idea of joining nearby NAMs for increased score/power and to reduce redundant alignments seem logical. However, it hasn’t worked as effectively as expected. I found one reason, that I describe in this comment: https://github.com/ksahlin/strobealign/commit/3d9632189c483be06b00eb421728f0d61f9e6920#commitcomment-140411641
Even after fixing as above, the idea is still not giving a great benefit on a smaller benchmark, but there could be remaining issues with the logic, or the testing I did is not representative of what happens on the main branch implementation (described in comment).