Closed marcelm closed 1 year ago
Nice, let's discuss this tomorrow. Why is accuracy so high for e.g., rye and maize? Aren't they usually around 60-70% for 50nt reads?
Hmm, let me check, maybe I'm actually reporting mapping rate (I finally wrote a script to generate tables like the above, maybe it is still buggy)
I had a bit more time this afternoon to look at this. I am not sure why this accuracy(/or mapping rate) increase happens. Is it because return strobe1_start + w_min < string_hashes.size();
should actually be return strobe1_start + w_min <= string_hashes.size();
(i.e. off-by-one error)? That is, the possibility that the second strobe is the first syncmer in the last window. If so it is a bug.
If not an off-by-one error, I don't see why using the last syncmer in the read as a seed would match any reference sequence unless it is in the very end of each chromosome, which should be very rare.
Ok, it was mapping rate. This is accuracy.
dataset | e0764b6 | a7d0972 |
---|---|---|
drosophila-50 | 82.018 | 82.0289 (+0.0109) |
drosophila-100 | 89.1866 | 89.1873 (+0.0007) |
drosophila-200 | 91.9617 | 91.9509 (-0.0108) |
drosophila-300 | 93.2196 | 93.2322 (+0.0126) |
maize-50 | 47.1753 | 47.1568 (-0.0185) |
maize-100 | 70.4836 | 70.4523 (-0.0313) |
maize-200 | 86.6856 | 86.6417 (-0.0439) |
maize-300 | 91.8642 | 91.8621 (-0.0021) |
CHM13-50 | 81.1589 | 81.157 (-0.0019) |
CHM13-100 | 90.5471 | 90.5344 (-0.0127) |
CHM13-200 | 93.2054 | 93.1893 (-0.0161) |
CHM13-300 | 94.1524 | 94.1474 (-0.0050) |
rye-50 | 44.4281 | 44.441 (+0.0129) |
rye-100 | 69.2452 | 69.2054 (-0.0398) |
dataset | e0764b6 | a7d0972 |
---|---|---|
drosophila-50 | 90.10385 | 90.0976 (-0.0063) |
drosophila-100 | 92.3704 | 92.37985 (+0.0094) |
drosophila-200 | 93.5246 | 93.52665 (+0.0020) |
drosophila-300 | 95.3644 | 95.36025 (-0.0041) |
maize-50 | 71.2862 | 71.27655 (-0.0096) |
maize-100 | 87.135 | 87.09965 (-0.0353) |
maize-200 | 92.90125 | 92.847 (-0.0543) |
maize-300 | 96.71285 | 96.6977 (-0.0152) |
CHM13-50 | 90.48895 | 90.4824 (-0.0066) |
CHM13-100 | 93.2231 | 93.20945 (-0.0137) |
CHM13-200 | 94.4238 | 94.4007 (-0.0231) |
CHM13-300 | 95.629 | 95.6172 (-0.0118) |
rye-50 | 68.87565 | 68.8712 (-0.0044) |
rye-100 | 85.55535 | 85.42355 (-0.1318) |
Interesting, let us then try to understand the behaviour tomorrow before we make a decision. Although I am already now curious to see what happens when this PR is combined with commit d8dc256 (syncmerhash)
As discussed on meeting, this should not be merged. The seeds in ends are bad and confuses the aligner. We shouls get exact number of reads as N_references *(N_syncmers - w_min)
More from the the meeting:
See #307
I measured mapping runtime only on drosophila 50 and 100, where it increased by ~2%, which I understand because more strobemers are generated per read. ~~But it also increases accuracy, which is nice, but I don’t understand quite why.
The extra randstrobes are just the hash of one syncmer. I would expect that they are only found in the reference if the syncmer in the reference also does not have a partner. Is there some other reason why the accuracy increases or does this (that a syncmer in the reference does not have a partner) just happen often enough?~~
Edit: I measured mapping rate, not accuracy.
Single-end
accuracymapping ratePaired-end
accuracymapping rate