ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

Do not count randstrobes #304

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

Instead estimate how many there are using the $\text{length}/(k-s+1)$ estimate

See https://github.com/ksahlin/strobealign/pull/278#issuecomment-1587238956

ksahlin commented 1 year ago

Sounds great. But if s-mers with XXhash are implemented (https://github.com/ksahlin/strobealign/issues/216#issuecomment-1592529027), maybe we should add back the small constant C? this is under the assumption that underestimating seeds is really bad as we need to reallocate a larger vector if so.

marcelm commented 1 year ago

Yes, I have that in mind. I also thought about adding a warning in case the estimate turns out to be too low (asking the user to report this to us). But of course with a large enough C that shouldn’t happen.

ksahlin commented 1 year ago

sounds good! I think C = square root of the estimate you wrote above should be enough. A warning if that is too little sounds good to add.

ksahlin commented 1 year ago

Not applicable anymore because we need exact counts because of parallell index creation.