Closed marcelm closed 1 year ago
Sounds great. But if s-mers with XXhash are implemented (https://github.com/ksahlin/strobealign/issues/216#issuecomment-1592529027), maybe we should add back the small constant C? this is under the assumption that underestimating seeds is really bad as we need to reallocate a larger vector if so.
Yes, I have that in mind. I also thought about adding a warning in case the estimate turns out to be too low (asking the user to report this to us). But of course with a large enough C that shouldn’t happen.
sounds good! I think C = square root of the estimate you wrote above
should be enough. A warning if that is too little sounds good to add.
Not applicable anymore because we need exact counts because of parallell index creation.
Instead estimate how many there are using the $\text{length}/(k-s+1)$ estimate
See https://github.com/ksahlin/strobealign/pull/278#issuecomment-1587238956