Closed marcelm closed 1 year ago
Awesome, approved!
While measuring memory usage for #314, I also got numbers for how much memory increases due to this PR (XXH64). Here are the numbers for completeness.
dataset | before (e0764b6) | after (e32475e) | difference |
---|---|---|---|
drosophila-50 | 709.043 | 737.812 | +29 |
drosophila-100 | 736.922 | 770.031 | +33 |
drosophila-200 | 779.059 | 817.492 | +38 |
drosophila-300 | 841.98 | 863.266 | +21 |
maize-50 | 9661.88 | 10008.3 | +346 |
maize-100 | 9670.18 | 10017.2 | +347 |
maize-200 | 9677.71 | 10025.3 | +348 |
maize-300 | 9680.55 | 10025.9 | +345 |
CHM13-50 | 13977.4 | 14694.7 | +717 |
CHM13-100 | 13968.8 | 14685.5 | +717 |
CHM13-200 | 14005.1 | 14717.6 | +712 |
CHM13-300 | 14012.8 | 14789 | +776 |
rye-50 | 32938.5 | 34110.6 | +1172 |
rye-100 | 32981.6 | 34155.2 | +1174 |
It’s not that much, but I feel a little bit bad because we’re again eating up some of the memory savings from #278. It’ll be a bit weird in the changelog: "We reduced memory usage from 23 GiB to 13 GiB, but then made other changes and it’s now back at 14.7 GiB" ...
I don't feel bad about it :) PR https://github.com/ksahlin/strobealign/pull/278 allows us to
(i) change to 64-bit pointers (which robin hood couldn't do) which explains part of the re-increase. (ii) Allows us to have an 'unbiased' syncmer sampling protocol near expected density that also improves mapping rate and accuracy.
I only wrote "a little bit bad" ... :smile: No I think it’s fine.
Answering here regarding b tipping over: No, it’s 28 in both cases. It appears to be just the size of the randstrobes
vector: For CHM13-100, it contains 576889641 randstrobes before and 623250666 after. With 16 bytes/entry, that’s 740 MB more, so that should explain it.
Ah makes sense!
If we are out for saving memory we could always experiment with density (e.g., k-s = 6 instead of 4) but I think it's fine as is. The mapping rate would go down for the shortest reads with more thinning.
Single-end accuracy