ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Experiment: Asymmetric randstrobe hashes #405

Open marcelm opened 3 months ago

marcelm commented 3 months ago

I’ve written about this by e-mail, but thought I’d show the code as well and a table that shows a comparison of accuracies.

This PR includes two commits.

The first commit 7ca428d changes the formula for computing randstrobe hashes from syncmer1_hash + syncmer2_hash to 2 * syncmer1_hash - syncmer2_hash. Factor 2 is used to avoid a hash of zero when the two syncmer hashes are identical. This is the case at the end of the query when the syncmer is paired with itself due to there not being any downstream syncmers.

The second commit 1960bef introduces a "randstrobe rescue" step where (if no hits can be found), syncmers are paired up with all downstream syncmers between w_min and w_max to produce a set of hashes to look up in the index.

Here are the measured differences in accuracy, with some notable rows highlighted.

dataset d420610 → 7ca428d d420610 → 1960bef 7ca428d → 1960bef
ecoli50-150-se -0.0161 -0.0557 -0.0396
drosophila-50-se -0.3709 +0.0190 +0.3899
drosophila-75-se -0.1181 +0.0280 +0.1461
drosophila-100-se -0.0397 +0.0453 +0.0850
drosophila-150-se -0.0811 -0.0233 +0.0578
drosophila-200-se -0.0008 +0.0158 +0.0166
drosophila-300-se -0.0386 -0.0433 -0.0047
drosophila-500-se -0.0075 -0.0075 +0.0000
maize-50-se -0.1577 -0.0169 +0.1408
maize-75-se -0.1429 -0.0998 +0.0431
maize-100-se -0.1776 -0.1009 +0.0767
maize-150-se +0.0154 +0.0084 -0.0070
maize-200-se +0.0098 +0.0183 +0.0085
maize-300-se +0.0300 +0.0302 +0.0002
maize-500-se +0.0116 +0.0040 -0.0076
CHM13-50-se -0.2992 +0.0183 +0.3175
CHM13-75-se -0.0809 +0.0153 +0.0962
CHM13-100-se -0.1053 -0.0227 +0.0826
CHM13-150-se -0.0910 -0.0567 +0.0343
CHM13-200-se -0.0844 -0.0769 +0.0075
CHM13-300-se -0.1574 -0.1423 +0.0151
CHM13-500-se -0.2339 -0.2445 -0.0106
ecoli50-150-pe -0.0187 -0.0409 -0.0222
drosophila-50-pe -0.0283 +0.0085 +0.0368
drosophila-75-pe -0.0117 -0.0386 -0.0268
drosophila-100-pe -0.0137 -0.0131 +0.0006
drosophila-150-pe -0.0294 +0.0089 +0.0383
drosophila-200-pe -0.0372 -0.0232 +0.0140
drosophila-300-pe -0.0218 -0.0276 -0.0058
drosophila-500-pe -0.0312 -0.0294 +0.0018
maize-50-pe +0.0389 -0.0417 -0.0806
maize-75-pe -0.0629 -0.1999 -0.1370
maize-100-pe -0.0647 -0.1508 -0.0862
maize-150-pe -0.0206 -0.0398 -0.0192
maize-200-pe +0.0066 +0.0002 -0.0063
maize-300-pe -0.0037 +0.0070 +0.0106
maize-500-pe -0.0178 -0.0267 -0.0090
CHM13-50-pe -0.0413 -0.0369 +0.0044
CHM13-75-pe -0.0695 -0.0795 -0.0099
CHM13-100-pe -0.1190 -0.1280 -0.0090
CHM13-150-pe -0.1320 -0.1410 -0.0090
CHM13-200-pe -0.1575 -0.1429 +0.0146
CHM13-300-pe -0.2974 -0.3209 -0.0235
CHM13-500-pe -0.3394 -0.3449 -0.0054
ksahlin commented 3 months ago

Update: The below comment is wrong because I misinterpreted the rightmost column.

This is a great analysis, because I think it sets rough expectations on how much we could hope multi-contexct seeds can improve when we switch to asymmetrical seeds.

I would expect asymmetrical seeds to be performing as well as in the rightmost column (as it uses a related type of rescuing mechanism), but this is not what we saw in our initial implementation of asymmetrical multi-context seeds (CC @Itolstoganov). We should keep this reference benchmark in mind if/when we revert to asymmetrical multi-context seeds.

marcelm commented 3 months ago

I’ve been working on this today. I looked at the set of reads whose alignment becomes incorrect when switching from symmetric to asymmetric randstrobes (using CHM13-500-se). I found that about one third of them are aligned to chrY (and 17% on chr1, 14% on chr9). I believe you mentioned something about PAR/chrX/chrY, so you may have expected this.

Also, NAM rescue was done for ~70% of these reads, while the percentage is ~8% for the full dataset. NAM rescue is done when no NAMs have been found or when the nonrepetitive_fraction is less than 0.7. The distribution of the nonrepetiteveness values is also skewed a lot towards the repetitive end for these reads.

So this points to repetitiveness and repetitiveness filtering being the main issue. I’ll investigate further in that direction.

ksahlin commented 3 months ago

I have a more detailed picture of why this happens and a potential fix(!)

Why it happens: One of the directions (say fw) has many repetitive seeds (for the sake of example, say a nonrepetitive fraction of 0.5 for fw direction). Let's further say that the reverse direction has no repetitive seeds (nonrepetitive_fraction of 1.0). Assuming the sides have the same number of seeds, this leads to a total nonrepetitive_fraction of 0.75, and it passes the threshold.

Consequently, the read will most likely have the highest-scoring NAMs in the reverse direction. If the forward direction was the correct location, we would get a mismapped read. We could probably overcome this by keeping two separate nonrepetitive_fractions (nonrepetitive_fraction_fw and nonrepetitive_fraction_rc) in find_nams. We should then call rescue on either both sides* or one of the sides if one of the sides does not pass the threshold.

(*) to be able to fairly compare the NAM scores between FW and RC as they are different in find_nams and find_nams_rescue.

ksahlin commented 3 months ago

I also notice that I misinterpreted the last column in the table (i.e., https://github.com/ksahlin/strobealign/commit/7ca428d940d7ca8fe9d5648e6d7372961f0e7fcahttps://github.com/ksahlin/strobealign/commit/1960befe6213a675db4aec49653bba127e1e9820).

I previously thought the numbers in this column are the percent point accuracy difference to the baseline (https://github.com/ksahlin/strobealign/commit/d42061067564028a2c91bc4b885a5686c32671e9) when both 'features' are implemented, but I now see it's the difference between the two other columns.

This is a great analysis, because I think it sets rough expectations on how much we could hope multi-contexct seeds can improve when we switch to asymmetrical seeds.

Then, this comment is wrong (from post https://github.com/ksahlin/strobealign/pull/405#issuecomment-1999684363).

(I looked back at the table to check if my potential insight and the "fix" I wrote above relate to the table you posted. But now I realize they are not related and cannot be compared.)

ksahlin commented 3 months ago

Two comments on my proposed fix:

  1. An initial test to check if I am in the right direction with my guess is to put (nams.empty() || nonrepetitive_fraction < 0.9999) to always invoke rescue NAMs if there is any repetitive seed.
  2. My proposed fix is still not guaranteed to produce the same accuracy as symmetric seeds because we still have the hard seed threshold of 1000 which may give the same bias/error as the find_nams function - the reverse direction could have more seeds passing the hard threshold (< 1000) than the forward direction.

Perhaps we can think further on a solution for 2 if we observe that it causes a problem.

ksahlin commented 3 months ago

I had a deeper look into this using simulated reads only from CHM13 ChrX+Y. It is quite easy finding many examples of reads that turn bad when changing from symmetric to asymmetric randstrobe hashes. Many reads seem to suffer from what I predicted above, namely that NAMs from correct region is short (and gets low score) because many surrounding hits are repetitive (masked). In symmetric mode it is fine because all repetitive regions are masked, regardless of direction.

An example read below where symmetric randstrobe hash top-list (the top NAM creates a perfect match (500=):

N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 58217301..58217505, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 37485087..37485291, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 45364578..45364782, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 59060140..59060344, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 45359741..45359945, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 59933869..59934073, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 60531253..60531457, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 57090442..57090646, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 57119469..57119673, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 59999155..59999359, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 59089171..59089375, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 37480250..37480454, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 56669570..56669774, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 58246332..58246536, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 59996735..59996939, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 58793991..58794195, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 40066776..40066980, score=1836)
N Hits in NAM:9 Nam(ref_id=1, query: 163..367, ref: 58045514..58045718, score=1836)

In asymmetric mode, some inversions don't have the repetitive (masked) seeds, and are thus able to create longer and higher scoring NAMs, which makes the true region fall lower in the list and don't make the top M (20) candidates .

The NAM score top-list for the same read with asymmetric randstrobe hashes:

N Hits in NAM:16 Nam(ref_id=1, query: 3..469, ref: 29233156..29233622, score=7456)
N Hits in NAM:15 Nam(ref_id=1, query: 29..489, ref: 61148412..61148872, score=6900)
N Hits in NAM:17 Nam(ref_id=1, query: 75..469, ref: 61401918..61402312, score=6698)
N Hits in NAM:13 Nam(ref_id=1, query: 3..483, ref: 29235577..29236057, score=6240)
N Hits in NAM:13 Nam(ref_id=1, query: 8..483, ref: 29238002..29238477, score=6175)
N Hits in NAM:14 Nam(ref_id=1, query: 8..430, ref: 29218659..29219081, score=5908)
N Hits in NAM:14 Nam(ref_id=1, query: 8..430, ref: 29230746..29231168, score=5908)
N Hits in NAM:12 Nam(ref_id=1, query: 29..489, ref: 28674017..28674476, score=5496
...

First I thought we could revise the NAM score or implement separate counters for tried and thresholds max_tried to separately treat if hamming_align or SSW align were called (which would solve the above case). But these suggestions can only patch the symptom (although implementing separate counters would still be a a good idea, in the general case).

I don't have a solution yet, some more thinking is required. Another line of thought is to revisit why we wanted asymmetric seeds in the first place(?) - was it only to remove the false reverse hits? If so, maybe this can be dealt with in a different way, using symmetric seeds. My implementation of joining nearby NAMs (https://github.com/ksahlin/strobealign/issues/415) seems to work regardless of false reverse hits or not.

Read used in above example pasted below.

@simulated.146/1
TAGATAACATAATACACATTTATATGTATTATGATATATAACAAAGGTACTATATCATATATGATATATAATATATATATTTGTAGTACATATTATAATTTTTATATATTATAATTTATATAATGATAAAATTTTTTATAGAATATAAATAATTATATATGATTATATAATTCTACATATTATAAATGAAAATATGTAGAATTACATGTTTATCTAGTTATATGTATAAAAATATAATTATATATATAGTTATGTATATCACATAATAGACAACATACATATTATATAATGTGTAATATATATTATAGAATATATGATACATGATATATATGATATTTAATATAACACAATATAATCTATATTATAATATTATGTATGTTACTTATATTGGGTGATATGTAATATATATTATGTAATATGAAATAATATAATATATATTATATTATGATATTTTATGTAAATTATGTTATAAAAGTATATATAACATAATATATAGTTAT
+
IHHIHIIHHIIHIIIIIIHIHHHHHIIGHIIGHHHIIHHHGHHHIHGIIIHIHHIIHIIIIGHIGHGIIIIIIIHGHIIGHGIIHGIHIHIHIHIIIIIIIHIIIIGIIIIDEFHGFHAIECGIIIGHIGHFIDFEIIHIGHIIIGDCEGIIIFIDIIIGFIIIIIIIIIIIIIIGFHIIDIGIIEGIGCIGGHFIHIEIFIEHFIIIIDIDDIIIIIGIIIGGBFI@H7EFDIIBIIBIEI?GHDFDEGIHAIGIIIIGEIGIGIIIHIDII;IICIII;=IAHI@IIII>IIII?FGIIAI@C@IBAIDEDII<HHIIIIIFIGIICIDD3BGIGIHIICEIAIIDI@EBB=IHGIIHIIIC@GHECIIIIEIBIC9IIEFGI>HIIGBIEIIICIHICIHI=HIIFDIIICGAII=FIDI@IIII?IIE?IDIIIIIDFI?IIIIIFIEIGEHH>?IIGFIIIIII<B@IIIDIIIBCG/<IH7ICF;HHI>GHII?