ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Use xxh64 to hash the s-mer in syncmers #313

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

Single-end accuracy

dataset main (37f30e5) this PR (6c67bfc)
drosophila-50 82.018 82.5118 (+0.4938)
drosophila-100 89.1866 89.2585 (+0.0719)
drosophila-300 93.2196 93.2169 (-0.0027)
CHM13-50 81.1589 81.691 (+0.5321)
CHM13-100 90.5471 90.6392 (+0.0921)
CHM13-150 92.377 92.399 (+0.0220)
CHM13-200 93.2054 93.2258 (+0.0204)
CHM13-300 94.1524 94.1503 (-0.0021)
maize-50 47.1753 47.4174 (+0.2421)
maize-100 70.4836 70.5 (+0.0164)
maize-200 86.6856 86.701 (+0.0154)
rye-50 44.4281 44.6869 (+0.2588)
rye-100 69.2452 69.3358 (+0.0906)
ksahlin commented 1 year ago

Awesome, approved!

marcelm commented 1 year ago

While measuring memory usage for #314, I also got numbers for how much memory increases due to this PR (XXH64). Here are the numbers for completeness.

Change in memory usage (MB)

dataset before (e0764b6) after (e32475e) difference
drosophila-50 709.043 737.812 +29
drosophila-100 736.922 770.031 +33
drosophila-200 779.059 817.492 +38
drosophila-300 841.98 863.266 +21
maize-50 9661.88 10008.3 +346
maize-100 9670.18 10017.2 +347
maize-200 9677.71 10025.3 +348
maize-300 9680.55 10025.9 +345
CHM13-50 13977.4 14694.7 +717
CHM13-100 13968.8 14685.5 +717
CHM13-200 14005.1 14717.6 +712
CHM13-300 14012.8 14789 +776
rye-50 32938.5 34110.6 +1172
rye-100 32981.6 34155.2 +1174

It’s not that much, but I feel a little bit bad because we’re again eating up some of the memory savings from #278. It’ll be a bit weird in the changelog: "We reduced memory usage from 23 GiB to 13 GiB, but then made other changes and it’s now back at 14.7 GiB" ...

ksahlin commented 1 year ago

I don't feel bad about it :) PR https://github.com/ksahlin/strobealign/pull/278 allows us to

(i) change to 64-bit pointers (which robin hood couldn't do) which explains part of the re-increase. (ii) Allows us to have an 'unbiased' syncmer sampling protocol near expected density that also improves mapping rate and accuracy.

marcelm commented 1 year ago

I only wrote "a little bit bad" ... :smile: No I think it’s fine.

Answering here regarding b tipping over: No, it’s 28 in both cases. It appears to be just the size of the randstrobes vector: For CHM13-100, it contains 576889641 randstrobes before and 623250666 after. With 16 bytes/entry, that’s 740 MB more, so that should explain it.

ksahlin commented 1 year ago

Ah makes sense!

If we are out for saving memory we could always experiment with density (e.g., k-s = 6 instead of 4) but I think it's fine as is. The mapping rate would go down for the shortest reads with more thinning.