ksahlin / strobemers

A repository for generating strobemers and evalaution
75 stars 12 forks source link

Generating Hybridstrobes #3

Closed lutfia95 closed 3 years ago

lutfia95 commented 3 years ago

Hi, how can I use your index.py to generate Hybridstrobes from a given sequence? For example I have a sequence with the parameters: (2,15,20,70) and I want to construct Hybridstrobes (without output in a file, in the console will be also good)

thanks! Ahmad

ksahlin commented 3 years ago

Hi @lutfia95,

As per the documentation, you can download the file indexing.py present in the modules folder in this repository, and then do:

import indexing
all_mers = defaultdict(list)
for (p1,p2), s in indexing.hybridstrobes_iter(seq, k_size, w_min, w_max, order = 2):
    all_mers[s].append( (p1,p2) ) 

or if you want hybridstrobes of order 3:

for (p1,p2, p3), s in indexing.hybridstrobes_iter(seq, k_size, w_min, w_max, order = 3):
    all_mers[s].append( (p1,p2,p3) ) 

this will give you a hash value s representation of the actual sequence of the strobemer, as well as the positions of the strobe. To reconstruct the actual sequence of the strobemer, you can, e.g., do:

for (p1,p2, p3), s in indexing.hybridstrobes_iter(seq, k_size, w_min, w_max, order = 3):
     hybridstrobe = seq[p1:p1+k_size] + seq[p2:p2+k_size] + seq[p3:p3+k_size]
     # do more stuff
ksahlin commented 3 years ago

The above example assumes that you have a script (such as main.py) with the loops. You can of course put the loops directly in the indexing.py script too. In this case, you don't have to write import indexing. This may however obfuscate your code. It depends on your goal.

lutfia95 commented 3 years ago

Thank you for your answer! I am interesting in the actual sequence of the hybridstrobes, in my test I added this loop to indexing.py (calling by python3 indexing.py :

seq = "AATGCTTATGAACGAACGCTCCAATATGAATCAGCTCGTGATTTTTGCTGATAAA"
k_size = 13
w_min = 13
w_max = 21
w = 35
all_mers = defaultdict(list)
for (p1,p2, p3), s in hybridstrobes_iter(seq, k_size, w_min, w_max, w, order = 2):
    hybridstrobe = seq[p1:p1+k_size] + seq[p2:p2+k_size] + seq[p3:p3+k_size]

running this script are causing an error in the console:

Traceback (most recent call last):
  File "indexing.py", line 659, in <module>
    for (p1,p2, p3), s in hybridstrobes_iter(seq, k_size, w_min, w_max, w, order = 2):
  File "indexing.py", line 453, in hybridstrobes_iter
    yield positions, m
NameError: name 'positions' is not defined
ksahlin commented 3 years ago

I'm assuming you have modified the indexing.py script because line 453 in the error message does not correspond to line 453 in the indexing.py module in the repo. That is one of the reasons not to modify it, but instead, call the original indexing.py from a main.py.

However, yes there was a bug. I have now fixed it in commit 9d05e2f. it should say yield p, m not yield positions, m on line 453 in your script.

Also note that w is the thinning parameter. If you set w this high it will probably create only a couple of hybridstrobes from your example sequence. To generate hybridstrobes without thinning (like how regular kmers are generated), you would have to set w=1.

lutfia95 commented 3 years ago

Thanks