ksahlin / strobemers

A repository for generating strobemers and evalaution
75 stars 12 forks source link

Parameters #4

Closed lutfia95 closed 3 years ago

lutfia95 commented 3 years ago

Hi Kristoffer, I'm a little confused about choosing the parameters for Hybridstrobes. So as I know that:

Wmin >= k (kmer-size)
Wmax > Wmin

I'm just not sure about choosing the Wmin and Wmax!

Thanks!

ksahlin commented 3 years ago

Hi @lutfia95,

The parameter combination depends on your data. What do you want to do? How long are your sequences? As you can see in the preprint, I used (n=2, k=15, w_min=25, w_max=50) or (n=3, k=10, w_min=25, w_max=50) for simulated data. I have also used w_min=k+1 and w_maxeverything in interval [w_min+10, 120].

Note 1: In my python implementation, hybridstrobes are only generated until w_max reaches the end of the sequence, which is different from minstrobes and randstrobes. This is bad for very short sequences, as fewer strobes will be generated, but does not matter for longer sequences in practice. I'm also constructing functions to create hybridtrobes in C++ that does not have this limitation (https://github.com/ksahlin/strobemers/blob/main/strobemers_cpp/index.cpp#L491). Currently, only n=2 is implemented and x=3 (window partition for hybridstrobes).

Note 2: Randstrobes are better for most applications, they are also super fast to construct in C/C++ (similar speed to hybridstrobes). So you likely want to use randstrobes if you are using some compiled language. In python however, they are relatively slow to generate. My C++ implementation has functions to generate ranstrobes of order 2 and 3, that should be relatively optimized for speed.

lutfia95 commented 3 years ago

Thanks for the answer, I found the cpp implementation from shenwei356 (https://github.com/BGI-Qingdao/strobemer_cpptest) it was nice and fast. I am now interested in Hyrbidstrobes, because when I progress a huge data sets such as human reference genome, I think Hybridstrobes will be faster than randstrobes. I think I will try at the first your parameters from the preprint!

Thanks!

lutfia95 commented 3 years ago

Maybe one more question ;) why is w_max everything in interval [w_min+10, 120]. What is 120 here?

ksahlin commented 3 years ago

You can set it much larger, I was just writing what I have been testing for biological data. There is no constraint. However, if you set it very large (I'm thinking several thousands of bases) it may be less likely that strobemers match. Particularly, if there are many indels in the region causing large offsets in the windows between two sequences. This depends on your application and how sequences differ from each other. I have not explored what the best parametrization is for various data. It may be perfectly fine to set windows of several thousand bases if the sequences mostly differ in SNPs.

lutfia95 commented 3 years ago

Thanks for the answer! I will check it with some ONT data sets. If I find something interesting, I will let you know. I will close the issue. Thanks!