ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

Ensure syncmer and randstrobe iterators yield same no. of items #310

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

See #307

I measured mapping runtime only on drosophila 50 and 100, where it increased by ~2%, which I understand because more strobemers are generated per read. ~~But it also increases accuracy, which is nice, but I don’t understand quite why.

The extra randstrobes are just the hash of one syncmer. I would expect that they are only found in the reference if the syncmer in the reference also does not have a partner. Is there some other reason why the accuracy increases or does this (that a syncmer in the reference does not have a partner) just happen often enough?~~

Edit: I measured mapping rate, not accuracy.

Single-end accuracy mapping rate

dataset before (e0764b6) this PR (a7d0972)
drosophila-50 94.4233 94.453 (+0.0297)
drosophila-100 99.4656 99.4696 (+0.0040)
drosophila-200 99.7226 99.7225 (-0.0001)
drosophila-300 99.6863 99.6863 (+0.0000)
maize-50 95.3226 95.6045 (+0.2819)
maize-100 99.4867 99.5624 (+0.0757)
maize-200 99.7433 99.7444 (+0.0011)
maize-300 99.6957 99.6957 (+0.0000)
CHM13-50 94.2214 94.5568 (+0.3354)
CHM13-100 99.4495 99.4839 (+0.0344)
CHM13-200 99.732 99.7357 (+0.0037)
CHM13-300 99.693 99.6931 (+0.0001)
rye-50 94.4464 96.2626 (+1.8162)
rye-100 99.2582 99.6555 (+0.3973)

Paired-end accuracy mapping rate

dataset e0764b6 a7d0972
drosophila-50 99.4863 99.48805 (+0.0018)
drosophila-100 99.80925 99.80935 (+0.0001)
drosophila-200 99.73955 99.73925 (-0.0003)
drosophila-300 99.6903 99.6903 (+0.0000)
maize-50 99.4366 99.4832 (+0.0466)
maize-100 99.8021 99.8055 (+0.0034)
maize-200 99.7529 99.7529 (+0.0000)
maize-300 99.6946 99.6946 (+0.0000)
CHM13-50 99.3389 99.3783 (+0.0394)
CHM13-100 99.80495 99.806 (+0.0011)
CHM13-200 99.74855 99.74865 (+0.0001)
CHM13-300 99.69325 99.69325 (+0.0000)
rye-50 99.2548 99.5159 (+0.2611)
rye-100 99.80015 99.8126 (+0.0125)
ksahlin commented 1 year ago

Nice, let's discuss this tomorrow. Why is accuracy so high for e.g., rye and maize? Aren't they usually around 60-70% for 50nt reads?

marcelm commented 1 year ago

Hmm, let me check, maybe I'm actually reporting mapping rate (I finally wrote a script to generate tables like the above, maybe it is still buggy)

ksahlin commented 1 year ago

I had a bit more time this afternoon to look at this. I am not sure why this accuracy(/or mapping rate) increase happens. Is it because return strobe1_start + w_min < string_hashes.size(); should actually be return strobe1_start + w_min <= string_hashes.size(); (i.e. off-by-one error)? That is, the possibility that the second strobe is the first syncmer in the last window. If so it is a bug.

If not an off-by-one error, I don't see why using the last syncmer in the read as a seed would match any reference sequence unless it is in the very end of each chromosome, which should be very rare.

marcelm commented 1 year ago

Ok, it was mapping rate. This is accuracy.

Single end

dataset e0764b6 a7d0972
drosophila-50 82.018 82.0289 (+0.0109)
drosophila-100 89.1866 89.1873 (+0.0007)
drosophila-200 91.9617 91.9509 (-0.0108)
drosophila-300 93.2196 93.2322 (+0.0126)
maize-50 47.1753 47.1568 (-0.0185)
maize-100 70.4836 70.4523 (-0.0313)
maize-200 86.6856 86.6417 (-0.0439)
maize-300 91.8642 91.8621 (-0.0021)
CHM13-50 81.1589 81.157 (-0.0019)
CHM13-100 90.5471 90.5344 (-0.0127)
CHM13-200 93.2054 93.1893 (-0.0161)
CHM13-300 94.1524 94.1474 (-0.0050)
rye-50 44.4281 44.441 (+0.0129)
rye-100 69.2452 69.2054 (-0.0398)

Paired end

dataset e0764b6 a7d0972
drosophila-50 90.10385 90.0976 (-0.0063)
drosophila-100 92.3704 92.37985 (+0.0094)
drosophila-200 93.5246 93.52665 (+0.0020)
drosophila-300 95.3644 95.36025 (-0.0041)
maize-50 71.2862 71.27655 (-0.0096)
maize-100 87.135 87.09965 (-0.0353)
maize-200 92.90125 92.847 (-0.0543)
maize-300 96.71285 96.6977 (-0.0152)
CHM13-50 90.48895 90.4824 (-0.0066)
CHM13-100 93.2231 93.20945 (-0.0137)
CHM13-200 94.4238 94.4007 (-0.0231)
CHM13-300 95.629 95.6172 (-0.0118)
rye-50 68.87565 68.8712 (-0.0044)
rye-100 85.55535 85.42355 (-0.1318)
ksahlin commented 1 year ago

Interesting, let us then try to understand the behaviour tomorrow before we make a decision. Although I am already now curious to see what happens when this PR is combined with commit d8dc256 (syncmerhash)

ksahlin commented 1 year ago

As discussed on meeting, this should not be merged. The seeds in ends are bad and confuses the aligner. We shouls get exact number of reads as N_references *(N_syncmers - w_min)

marcelm commented 1 year ago

More from the the meeting: