ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Closed syncmers with lower density #429

Open Daniel-Liu-c0deb0t opened 2 weeks ago

Daniel-Liu-c0deb0t commented 2 weeks ago

Hey it's me again :) I briefly corresponded with @ksahlin on a refined version of closed syncmers with 25-30% lower density for realistic values of k. This might be useful for strobealign.

optimal_closed_syncmers

My code for generating these asymptotically optimal density closed syncmers is here: https://github.com/Daniel-Liu-c0deb0t/dlb-kmer-sampling/blob/main/src/lib.rs#L30 and I can explain the algorithm in more detail if needed.

ksahlin commented 2 weeks ago

Thanks, Daniel, very interesting work! This will be interesting to try.

Context to Daniel's post:

A closed syncmer is a k-mer sampled when the first or last s-mer is the smallest in the window. We currently sample a syncmer when the middle s-mer is the smallest (open syncmer).

I have been testing closed syncmers at several times in strobealign - they never perform quite as well as open syncmers when sampling middle s-mer (we use the third s-smer when density is 1/5). This is expected because open syncmers have better spread (garanteed lower distance bound of 3 when the density is 1/5), showed by Shaw & Yu, 2021. Many traditional closed syncmers (upper panel in Daniels plot) are sampled at distance 1 from each other (i.e., not a good spread).

However, open syncmers come at the cost of not having a window guarantee, so some regions might be sparsely sampled. Daniels' plot shows that we can possibly get both a good spread and the window guarantee to ensure that all regions have enough seeds.