marcelm commented 1 year ago

See #307

I measured mapping runtime only on drosophila 50 and 100, where it increased by ~2%, which I understand because more strobemers are generated per read. ~~But it also increases accuracy, which is nice, but I don’t understand quite why.

The extra randstrobes are just the hash of one syncmer. I would expect that they are only found in the reference if the syncmer in the reference also does not have a partner. Is there some other reason why the accuracy increases or does this (that a syncmer in the reference does not have a partner) just happen often enough?~~

Edit: I measured mapping rate, not accuracy.

Single-end accuracy mapping rate

dataset	before (e0764b6)	this PR (a7d0972)
drosophila-50	94.4233	94.453 (+0.0297)
drosophila-100	99.4656	99.4696 (+0.0040)
drosophila-200	99.7226	99.7225 (-0.0001)
drosophila-300	99.6863	99.6863 (+0.0000)
maize-50	95.3226	95.6045 (+0.2819)
maize-100	99.4867	99.5624 (+0.0757)
maize-200	99.7433	99.7444 (+0.0011)
maize-300	99.6957	99.6957 (+0.0000)
CHM13-50	94.2214	94.5568 (+0.3354)
CHM13-100	99.4495	99.4839 (+0.0344)
CHM13-200	99.732	99.7357 (+0.0037)
CHM13-300	99.693	99.6931 (+0.0001)
rye-50	94.4464	96.2626 (+1.8162)
rye-100	99.2582	99.6555 (+0.3973)

Paired-end accuracy mapping rate

dataset	e0764b6	a7d0972
drosophila-50	99.4863	99.48805 (+0.0018)
drosophila-100	99.80925	99.80935 (+0.0001)
drosophila-200	99.73955	99.73925 (-0.0003)
drosophila-300	99.6903	99.6903 (+0.0000)
maize-50	99.4366	99.4832 (+0.0466)
maize-100	99.8021	99.8055 (+0.0034)
maize-200	99.7529	99.7529 (+0.0000)
maize-300	99.6946	99.6946 (+0.0000)
CHM13-50	99.3389	99.3783 (+0.0394)
CHM13-100	99.80495	99.806 (+0.0011)
CHM13-200	99.74855	99.74865 (+0.0001)
CHM13-300	99.69325	99.69325 (+0.0000)
rye-50	99.2548	99.5159 (+0.2611)
rye-100	99.80015	99.8126 (+0.0125)

ksahlin commented 1 year ago

Nice, let's discuss this tomorrow. Why is accuracy so high for e.g., rye and maize? Aren't they usually around 60-70% for 50nt reads?

marcelm commented 1 year ago

Hmm, let me check, maybe I'm actually reporting mapping rate (I finally wrote a script to generate tables like the above, maybe it is still buggy)

ksahlin commented 1 year ago

I had a bit more time this afternoon to look at this. I am not sure why this accuracy(/or mapping rate) increase happens. Is it because return strobe1_start + w_min < string_hashes.size(); should actually be return strobe1_start + w_min <= string_hashes.size(); (i.e. off-by-one error)? That is, the possibility that the second strobe is the first syncmer in the last window. If so it is a bug.

If not an off-by-one error, I don't see why using the last syncmer in the read as a seed would match any reference sequence unless it is in the very end of each chromosome, which should be very rare.

marcelm commented 1 year ago

Ok, it was mapping rate. This is accuracy.

Single end

dataset	e0764b6	a7d0972
drosophila-50	82.018	82.0289 (+0.0109)
drosophila-100	89.1866	89.1873 (+0.0007)
drosophila-200	91.9617	91.9509 (-0.0108)
drosophila-300	93.2196	93.2322 (+0.0126)
maize-50	47.1753	47.1568 (-0.0185)
maize-100	70.4836	70.4523 (-0.0313)
maize-200	86.6856	86.6417 (-0.0439)
maize-300	91.8642	91.8621 (-0.0021)
CHM13-50	81.1589	81.157 (-0.0019)
CHM13-100	90.5471	90.5344 (-0.0127)
CHM13-200	93.2054	93.1893 (-0.0161)
CHM13-300	94.1524	94.1474 (-0.0050)
rye-50	44.4281	44.441 (+0.0129)
rye-100	69.2452	69.2054 (-0.0398)

Paired end

dataset	e0764b6	a7d0972
drosophila-50	90.10385	90.0976 (-0.0063)
drosophila-100	92.3704	92.37985 (+0.0094)
drosophila-200	93.5246	93.52665 (+0.0020)
drosophila-300	95.3644	95.36025 (-0.0041)
maize-50	71.2862	71.27655 (-0.0096)
maize-100	87.135	87.09965 (-0.0353)
maize-200	92.90125	92.847 (-0.0543)
maize-300	96.71285	96.6977 (-0.0152)
CHM13-50	90.48895	90.4824 (-0.0066)
CHM13-100	93.2231	93.20945 (-0.0137)
CHM13-200	94.4238	94.4007 (-0.0231)
CHM13-300	95.629	95.6172 (-0.0118)
rye-50	68.87565	68.8712 (-0.0044)
rye-100	85.55535	85.42355 (-0.1318)

ksahlin commented 1 year ago

Interesting, let us then try to understand the behaviour tomorrow before we make a decision. Although I am already now curious to see what happens when this PR is combined with commit d8dc256 (syncmerhash)

ksahlin commented 1 year ago

As discussed on meeting, this should not be merged. The seeds in ends are bad and confuses the aligner. We shouls get exact number of reads as N_references *(N_syncmers - w_min)

marcelm commented 1 year ago

ksahlin / strobealign

Ensure syncmer and randstrobe iterators yield same no. of items #310

Single-end accuracy mapping rate

Paired-end accuracy mapping rate

Single end

Paired end