ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Joining nearby NAMs for combined score #415

Open ksahlin opened 3 months ago

ksahlin commented 3 months ago

The idea of joining nearby NAMs for increased score/power and to reduce redundant alignments seem logical. However, it hasn’t worked as effectively as expected. I found one reason, that I describe in this comment: https://github.com/ksahlin/strobealign/commit/3d9632189c483be06b00eb421728f0d61f9e6920#commitcomment-140411641

Even after fixing as above, the idea is still not giving a great benefit on a smaller benchmark, but there could be remaining issues with the logic, or the testing I did is not representative of what happens on the main branch implementation (described in comment).

ksahlin commented 3 months ago

In the above comment, I had benchmarked with asymmetric seeds and concluded no significant advantage. Asymmetric seeds currently have other issues, which doesn't give a clean benchmark.

Therefore, I decided to implement the above described 'slack' based off the main branch (which uses symmetric seeds).

Below I provide a benchmark on a small chrX+Y dataset using strobealign-symmetric (current main) vs strobealign-symmetric-no-rescue-slack which:

  1. enters find_NAMs_rescue() only rescue if there is no NAM. That is if (nams.empty()) instead of if (nams.empty() || nonrepetitive_fraction < 0.7)
  2. sets slack = read_length/3.

In single end mode I also include strobealign-symmetric-no-rescuewith only includes the no-rescue 'feature'.

I observe quite drastic accuracy improvements on all but the 50nt reads with the strobealign-symmetric-no-rescue-slack . In addition, the no-rescue version is always faster.

It should be carefully noted that this is a small dataset. The rescue might do a better job on a full human genome(?).

PAIRED END READS

tool,ref,%-aligned,accuracy,time(sec),Mem(MB)
chm13_chrX_Y,50,strobealign-symmetric,align,99.929,**78.192**,0,14.09,1192.62
chm13_chrX_Y,50,strobealign-symmetric-no-rescue-slack,align,99.9245,78.1155,0,7.62,1192.88

chm13_chrX_Y,75,strobealign-symmetric,align,100.0,79.8775,0,12.2,1193.8
chm13_chrX_Y,75,strobealign-symmetric-no-rescue-slack,align,99.9995,**79.979**,0,6.21,1194.99

chm13_chrX_Y,125,strobealign-symmetric,align,100.0,82.0875,0,11.99,1196.45
chm13_chrX_Y,125,strobealign-symmetric-no-rescue-slack,align,100.0,82.085,0,7.99,1201.92

chm13_chrX_Y,150,strobealign-symmetric,align,100.0,82.7435,0,13.02,1198.98
chm13_chrX_Y,150,strobealign-symmetric-no-rescue-slack,align,100.0,**82.891**,0,10.02,1205.21

chm13_chrX_Y,250,strobealign-symmetric,align,99.999,85.882,0,20.06,1209.32
chm13_chrX_Y,250,strobealign-symmetric-no-rescue-slack,align,99.999,**86.1335**,0,19.2,1220.66

chm13_chrX_Y,500,strobealign-symmetric,align,99.998,87.9965,0,37.28,969.72
chm13_chrX_Y,500,strobealign-symmetric-no-rescue-slack,align,99.998,**88.4225**,0,32.64,987.56

SINGLE END READS

tool,ref,%-aligned,accuracy,time(sec),Mem(MB)
chm13_chrX_Y,50,strobealign-symmetric_SE,align,48.9205,**36.306**,0,3.31,1187.66
chm13_chrX_Y,50,strobealign-symmetric-no-rescue_SE,align,48.9205,36.2685,0,1.75,1187.69
chm13_chrX_Y,50,strobealign-symmetric-no-rescue-slack_SE,align,48.9205,36.2665,0,1.72,1187.57

chm13_chrX_Y,75,strobealign-symmetric_SE,align,49.8825,38.312,0,3.31,1187.1
chm13_chrX_Y,75,strobealign-symmetric-no-rescue_SE,align,49.8825,**38.3335**,0,1.74,1188.0
chm13_chrX_Y,75,strobealign-symmetric-no-rescue-slack_SE,align,49.8825,38.323,0,1.74,1188.52

chm13_chrX_Y,125,strobealign-symmetric_SE,align,49.9745,39.5835,0,3.72,1190.74
chm13_chrX_Y,125,strobealign-symmetric-no-rescue_SE,align,49.9745,39.5495,0,1.99,1190.32
chm13_chrX_Y,125,strobealign-symmetric-no-rescue-slack_SE,align,49.9745,**39.632**,0,2.02,1190.86

chm13_chrX_Y,150,strobealign-symmetric_SE,align,49.995,39.985,0,4.57,1192.77
chm13_chrX_Y,150,strobealign-symmetric-no-rescue_SE,align,49.995,39.9595,0,2.28,1195.56
chm13_chrX_Y,150,strobealign-symmetric-no-rescue-slack_SE,align,49.995,**40.004**,0,2.39,1195.71

chm13_chrX_Y,250,strobealign-symmetric_SE,align,49.999,41.1665,0,6.74,1197.88
chm13_chrX_Y,250,strobealign-symmetric-no-rescue_SE,align,49.999,41.2015,0,4.49,1210.02
chm13_chrX_Y,250,strobealign-symmetric-no-rescue-slack_SE,align,49.999,**41.2145**,0,4.81,1204.21

chm13_chrX_Y,500,strobealign-symmetric_SE,align,49.999,42.716,0,17.1,943.99
chm13_chrX_Y,500,strobealign-symmetric-no-rescue_SE,align,49.999,**42.8755**,0,12.23,945.01
chm13_chrX_Y,500,strobealign-symmetric-no-rescue-slack_SE,align,49.999,42.8645,0,12.53,958.21

(CC @Itolstoganov)