ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Large references #305

Closed marcelm closed 10 months ago

marcelm commented 1 year ago

Here is my to-do list for making strobealign work with references with more than $2^{32}$ strobemers. This will resolve #277 and #285.

Much of the above is done already and the question for me is whether we perhaps want to release a strobealign version now that unconditionally uses 64 bit bucket start indices because I see that it is a bit of work to switch to dynamically deciding which size of indices to use.

For CHM13, the index vector currently needs 1 GiB ($2^{28}$ entries times 4 bytes per entry). That size would double with 64-bit indices, so 2 GiB. Overall memory usage would increase from 13 GiB to 14 GiB. But then, memory usage was 22 GiB before merging #278, so the savings are still huge.

ksahlin commented 1 year ago

I think its a great idea to unconditionally set it to use 64 bit bucket start indices for now and make a release so that https://github.com/ksahlin/strobealign/issues/277 and https://github.com/ksahlin/strobealign/issues/285 gets resolved. Let's do that!

As you say, going from 1gb to 2gb in human or 2gb to 4gb in rye) is negligible compared to flat vector memory.

marcelm commented 10 months ago

When we met yesterday, we decided to close this issue for now as we think it is quite some work to dynamically switch between 32 and 64 bit indices. Always using 64 bits is good enough. Also, we are considering some further memory reduction, which are no longer possible if the index vector only uses 32 bits.