ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers

MIT License

128 stars 16 forks source link

Use xxh64 to hash the s-mer in syncmers #313

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

Single-end accuracy

dataset	main (37f30e5)	this PR (6c67bfc)
drosophila-50	82.018	82.5118 (+0.4938)
drosophila-100	89.1866	89.2585 (+0.0719)
drosophila-300	93.2196	93.2169 (-0.0027)
CHM13-50	81.1589	81.691 (+0.5321)
CHM13-100	90.5471	90.6392 (+0.0921)
CHM13-150	92.377	92.399 (+0.0220)
CHM13-200	93.2054	93.2258 (+0.0204)
CHM13-300	94.1524	94.1503 (-0.0021)
maize-50	47.1753	47.4174 (+0.2421)
maize-100	70.4836	70.5 (+0.0164)
maize-200	86.6856	86.701 (+0.0154)
rye-50	44.4281	44.6869 (+0.2588)
rye-100	69.2452	69.3358 (+0.0906)

ksahlin commented 1 year ago

Awesome, approved!

marcelm commented 1 year ago

While measuring memory usage for #314, I also got numbers for how much memory increases due to this PR (XXH64). Here are the numbers for completeness.

Change in memory usage (MB)

dataset	before (e0764b6)	after (e32475e)	difference
drosophila-50	709.043	737.812	+29
drosophila-100	736.922	770.031	+33
drosophila-200	779.059	817.492	+38
drosophila-300	841.98	863.266	+21
maize-50	9661.88	10008.3	+346
maize-100	9670.18	10017.2	+347
maize-200	9677.71	10025.3	+348
maize-300	9680.55	10025.9	+345
CHM13-50	13977.4	14694.7	+717
CHM13-100	13968.8	14685.5	+717
CHM13-200	14005.1	14717.6	+712
CHM13-300	14012.8	14789	+776
rye-50	32938.5	34110.6	+1172
rye-100	32981.6	34155.2	+1174

It’s not that much, but I feel a little bit bad because we’re again eating up some of the memory savings from #278. It’ll be a bit weird in the changelog: "We reduced memory usage from 23 GiB to 13 GiB, but then made other changes and it’s now back at 14.7 GiB" ...

ksahlin commented 1 year ago

I don't feel bad about it :) PR https://github.com/ksahlin/strobealign/pull/278 allows us to

(i) change to 64-bit pointers (which robin hood couldn't do) which explains part of the re-increase. (ii) Allows us to have an 'unbiased' syncmer sampling protocol near expected density that also improves mapping rate and accuracy.

marcelm commented 1 year ago

I only wrote "a little bit bad" ... :smile: No I think it’s fine.

Answering here regarding b tipping over: No, it’s 28 in both cases. It appears to be just the size of the randstrobes vector: For CHM13-100, it contains 576889641 randstrobes before and 623250666 after. With 16 bytes/entry, that’s 740 MB more, so that should explain it.

ksahlin commented 1 year ago

Ah makes sense!

If we are out for saving memory we could always experiment with density (e.g., k-s = 6 instead of 4) but I think it's fine as is. The mapping rate would go down for the shortest reads with more thinning.