Closed BjornFJohansson closed 1 year ago
Hi Bjorn!
I think suffix arrays could indeed help and I would be glad to spend some time helping on this.
Could you give some concrete example? The best would be to upload a text file with some input-output pairs. But a small toy example could be useful too! In particular, what do you do with overlapping patterns?
Hi and thanks for the quick reply. I pasted the docstring from the module I use for discovering common substrings between two strings longer than a specified limit.
Finds all common substrings between stringx and stringy
longer than limit. This function is case sensitive.
The substrings may overlap.
returns a list of tuples describing the substrings
The list is sorted longest -> shortest.
Parameters
----------
stringx : str
stringy : str
limit : int, optional
Returns
-------
list of tuple
[(startx1, starty1, length1),(startx2, starty2, length2), ...]
startx1 = startposition in x, where substring 1 starts
starty1 = position in y where substring 1 starts
length1 = lenght of substring
Examples
--------
>>> from pydna.common_sub_strings import common_sub_strings
>>> common_sub_strings("gatgatttcggtagtta", "gtcagtatgtctatctatcgcg", limit=3)
[(1, 6, 3), (7, 17, 3), (10, 4, 3), (12, 3, 3)]
::
Overlaps Symbols
(1, 6, 3) ---
(7, 17, 3) +++
(10, 4, 3) ...
(12, 3, 3) ===
===
gatgatttcggtagtta stringx
--- +++...
...
gtcagtatgtctatctatcgcg stringy
===--- +++
I should perhaps mention that most strings used with this algorithm rarely are longer than 10K chars and that most shared substrings are ~30 to 50 chars long.
Hello, I looked at your algorithm and it is very interesting. I implemented it with a twist in the common_substrings branch that is being tested.
First, one concatenates the two strings with a separator, eg ananas&banana
.
After computing the suffix array using the Skew algorithn and the LCP array using Kasai's algorithm, there is a nice processing going on:
For example, the LCP array 1,3,5,3,1,0,0,2,4,2,0,0,0
has 5 segments corresponding to "layers"
5
- 4
333 -
--- 222
11111 ---
-------------
Those segments are filtered to remove those that are associated with the same endings. For example, the 333
segment corresponds to the 4 occurrences of ana
and the 22
segment corresponds to the 4 occurrences of na
. Hence, the 22
segment should be filtered out. This is done by a dictionnary that uses the ending position in the string and the length of the segment in the LCP array as keys. It is easy to show that two segments will share those two characteristics if and only if they are associated with the same matches. For the LCP array above, we get the segments 11111
, 333
and 5
.
Each segment corresponds to a substring that appears k+1 times where k is the length of the segment in the LCP array. However, we are only interested in the substrings that appear in both strings. Thus, we have to divide the k+1 occurrences in 2 lists, one for each input string. Finally, we take the product of those two lists and take care of pairs appearing multiple times, when multiple segments overlap: the middle 3
corresponds to the first 3 letters of the 5
segment anana
. Thus, we use another dictionnary to make sure that in doubt we only consider the largest substring.
I reimplemented it, reusing the existing code for computing suffix arrays and the LCP array.
The code is super fast, because libdivsufsort is much faster than any other implementation, but my reimplementation of the above algorithm is also super fast because I used Cython. Furthermore, I used C++ containers like a vector for the stack and I think it really helped with speed.
The original implementation can be improved at step 3. When considering a segment of the LCP array, one does a potentially useless job of going through that segment to store the matches of the first and second string. This job leads sometimes to nothing if all the matches are in the same string.
There is an easy way to check that in O(1) time: pre-compute (very fast) the cumulative sum cumsum
of the array suffix_array < len(s1)
. This boolean array tells us where the matches appear. When considering a span (start, end)
, we compute diff = cumsum[end] - cumsum[start]
. If diff is either 0
or end - start
, then all substrings in the segment belong to the same string and we can skip the whole substring.
I did two tests, one with random strings, and one to illustrate the complexity improvement.
from time import time
import numpy as np
from pydna.common_sub_strings import common_sub_strings
from pydivsufsort import common_substrings
def profile(fun, s1, s2):
ans = float("inf")
for _ in range(3):
t = time()
solutions = len(fun(s1, s2, limit=15))
ans = min(ans, time() - t)
print(fun, solutions, ans)
size = 100_000
s1 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))
s2 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))
profile(common_sub_strings, s1, s2)
profile(common_substrings, s1, s2)
reps = 2_000
s1 = "banana" * reps
s2 = "ananas" * reps
profile(common_sub_strings, s1, s2)
profile(common_substrings, s1, s2)
<function common_sub_strings at 0x110a358b0> 9 1.7738101482391357
<function common_substrings at 0x110aa0280> 9 0.12374401092529297
<function common_sub_strings at 0x110a358b0> 0 1.5403878688812256
<function common_substrings at 0x110aa0280> 0 0.005788087844848633
This implementation is about 15x faster than the previous one, but the improvement is several hundreds when the cumsum array is used.
Note that a nice bonus from the cumsum idea is that there is a linear-time algorithm to count the number of common substrings more than some limit: ignore the segments with value less than the limit and the segments that are included in other segments. Then for each of the remaining segments, add diff * (end - start - diff)
.
Thanks! Ill try to run it on my machine right now.
For anyone seeing this exchange, this is not necessary, simply pip install -U pydivsufsort
should be enough to install all dependencies. See further down.
After some tinkering I managed to run the code by first pip installing from Github:
(bjorn311) bjorn@bjorn-ThinkPad-T450s:~$ pip install 'pydivsufsort @ git+https://github.com/louisabraham/pydivsufsort.git'
Collecting pydivsufsort@ git+https://github.com/louisabraham/pydivsufsort.git
Cloning https://github.com/louisabraham/pydivsufsort.git to /tmp/pip-install-z195l684/pydivsufsort_7cabef1830ac45fbbfba2b905616102d
Running command git clone --filter=blob:none --quiet https://github.com/louisabraham/pydivsufsort.git /tmp/pip-install-z195l684/pydivsufsort_7cabef1830ac45fbbfba2b905616102d
Resolved https://github.com/louisabraham/pydivsufsort.git to commit 1723ca81204ac7b1b5ee1cfc51fbc5d51794054f
Running command git submodule update --init --recursive -q
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: wheel in ./anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort@ git+https://github.com/louisabraham/pydivsufsort.git) (0.38.4)
Requirement already satisfied: numpy in ./anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort@ git+https://github.com/louisabraham/pydivsufsort.git) (1.24.3)
Building wheels for collected packages: pydivsufsort
Building wheel for pydivsufsort (pyproject.toml) ... done
Created wheel for pydivsufsort: filename=pydivsufsort-0.0.7-cp311-cp311-linux_x86_64.whl size=252555 sha256=774ed2b753df61078a9d4ced0928cec339cc5140b05add43dd2b9d686b0f9022
Stored in directory: /tmp/pip-ephem-wheel-cache-64q45mh7/wheels/97/0d/79/d416fadd80786b875198d517b82c2adec09181b0024704ec41
Successfully built pydivsufsort
Installing collected packages: pydivsufsort
Attempting uninstall: pydivsufsort
Found existing installation: pydivsufsort 0.0.6
Uninstalling pydivsufsort-0.0.6:
Successfully uninstalled pydivsufsort-0.0.6
Successfully installed pydivsufsort-0.0.7
(bjorn311) bjorn@bjorn-ThinkPad-T450s:~$
I then dumped the libdivsufsort and libdivsufsort64 files where the dll.py can find them:
I got the files from a Bioconda package (libdivsufsort-2.0.2-hec16e2b_7.tar.bz2) here: https://anaconda.org/bioconda/libdivsufsort/files
There is probably way to make use of the Bioconda package, but I couldn't yet figure out how. Unfortunately, Bioconda only builds for Linux and Mac.
In general a conda/mamba package would be a good idea, since I suspect many data scientists use Anaconda.
Anyway, it worked very well:
runfile('/home/bjorn/Desktop/python_packages/pydna/scripts/LongestCommonSubstring/lcs.py')
Reloaded modules: pydna._pretty, pydna.codon, pydna.utils, pydna, pydna.common_sub_strings
<function common_sub_strings at 0x7f363c217600> 10 1.4096662998199463
<function common_substrings at 0x7f363c214c20> 10 0.1297307014465332
<function common_sub_strings at 0x7f363c217600> 0 0.8529350757598877
<function common_substrings at 0x7f363c214c20> 0 0.006013154983520508
A really impressive speed up!
What are you doing??? The whole point of this repository is precisely to save you from all that work by compiling wheels for all platforms. Did you try simply installing using pip? pip install -U pydivsufsort
We have extensive testing on a grid of all versions of Python from 3.7 to 3.11 and all 3 OS, with both the 32 and 64 bits versions of Windows.
I think the wheels were not compiled for Python 3.11 before yesterday but now it should be ok. In any case, setup.py
is also supposed to allow you to compile the whole project if you also clone the git submodule of libdivsufsort.
Seems that the two functions return the same results in slightly different order. Pydna sorts by length of the shared sequence, longest first.
size = 100_000
s1 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))
s2 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))
common_sub_strings(s1, s2, limit=15)
Out[39]:
[(50477, 91027, 17),
(5624, 93369, 16),
(48148, 32086, 16),
(13103, 86016, 15),
(35242, 95651, 15),
(56419, 89541, 15)]
common_substrings(s1, s2, limit=15 )
Out[40]:
[(5624, 93369, 16),
(13103, 86016, 15),
(35242, 95651, 15),
(48148, 32086, 16),
(50477, 91027, 17),
(56419, 89541, 15)]
Did you succeed with the one-line installation?
Indeed. I updated the docstring to match that behavior. You can just add a sort:
substrings.sort(key=lambda x: x[2], reverse=True)
Thank you , worked perfectly! I did the fiddling because the first install was unsuccessful (pip install pydivsufsort ). On import, dll.py complained about missing files on linux.
On inspection, the installed dir had no libdivsufsort or libdivsufsort64 files, not sure why.
Now, they are in fact there:
(bjorn311) ✔ ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21]
09:50 $ pip install -U pydivsufsort
Collecting pydivsufsort
Using cached pydivsufsort-0.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
Requirement already satisfied: wheel in /home/bjorn/anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort) (0.38.4)
Requirement already satisfied: numpy in /home/bjorn/anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort) (1.24.3)
Installing collected packages: pydivsufsort
Successfully installed pydivsufsort-0.0.7
(bjorn311) ✔ ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21]
09:51 $ python
Python 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pydivsufsort
>>>
[1]+ Stopped python
(bjorn311) ✘-TSTP ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21]
09:52 $ python lcs.py
<function common_sub_strings at 0x7f90f1cd0e00> 13 1.3926060199737549
<function common_substrings at 0x7f90f1cf0d60> 13 0.1265397071838379
<function common_sub_strings at 0x7f90f1cd0e00> 0 0.8528392314910889
<function common_substrings at 0x7f90f1cf0d60> 0 0.006813764572143555
(bjorn311) ✔ ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21]
09:52 $
Tested with the pydna tests. The first with th original code and the second with pydivsufsort common_substrings, no other changes!
==================================================================== slowest 10 durations ====================================================================
9.26s call tests/test_module_assembly.py::test_MXblaster1
7.54s call tests/test_module_assembly.py::test_assembly
3.23s call tests/test_module_dseqrecord.py::test_synced
2.39s call tests/test_module_assembly.py::test_marker_replacement_on_plasmid
2.17s call tests/test_module_assembly.py::test_assemble_pGUP1
2.09s call tests/test_module_dseqrecord.py::test_map_pCR_MCT1_HA46
1.90s call tests/test_module_dseqrecord.py::test_features_change_ori
1.70s setup tests/jupyter_test_repr.ipynb::Cell 0
1.53s setup tests/format_sequences.ipynb::Cell 0
1.48s call tests/test_module_dseqrecord.py::test_assemble_YEp24PGK_XK
==================================================================== slowest 10 durations ====================================================================
1.91s call tests/test_module_dseqrecord.py::test_features_change_ori
1.87s call tests/test_module_assembly.py::test_MXblaster1
1.61s setup tests/format_sequences.ipynb::Cell 0
1.40s setup tests/jupyter_test_repr.ipynb::Cell 0
1.17s setup tests/save_changed_seq.ipynb::Cell 0
1.05s call tests/jupyter_test_repr.ipynb::Cell 25
0.97s call tests/test_module_dseqrecord.py::test_synced
0.94s call tests/test_module_genbankfixer.py::test_pydna_gbtext_clean
0.91s call tests/jupyter_test_repr.ipynb::Cell 29
0.83s call tests/jupyter_test_repr.ipynb::Cell 31
On import, dll.py complained about missing files on linux.
Yes that was because I didn't include python 3.11 in the task to build the wheels. All good now I guess!
In case you have time, I'm interested to know what part made it faster: suffix array computation, compilation of the post-processing or the cumsum idea. From your measurements, I would guess that the cumsum idea was important because the order of the tests changed. But I guess other computations are involved because test_features_change_ori
didn't change at all.
Btw, I just did a Zenodo release. If you use pydivsufsort in a publication, a citation would be much appreciated!
I think this issue can be closed now!
If there are other string algorithms that are bottlenecks in your code, please post other issues and I will see how I can add them to pydivsufsort.
You can also write me an email!
Note that a nice bonus from the cumsum idea is that there is a linear-time algorithm to count the number of common substrings more than some limit: ignore the segments with value less than the limit and the segments that are included in other segments. Then for each of the remaining segments, add
diff * (end - start - diff)
.
Thanks to cumsum, the new algorithm is O(n + m) with m the number of matches. Previously it was O(n^2) in some worst-case situations.
Also, for O(n) it is possible to give a list (length, count1, count2, pos1)
, just storing for each common substring its length, the number of times it appears in s1 and s2 and one occurrence in s1. One could also add the positions in the suffix array so that you have a simple algorithm to then retrieve all pairs of occurrences if that common substring is useful to you.
This could be nice to do data exploration! What makes the complexity O(n+m) is the enumeration of ALL pairs. The limit parameter sure helps but then you have to decide on its value to control m
while keeping useful results.
Hi Björn,
Following your email, I found the bug: I skipped if lcp[idx] > 0
instead of if lcp[idx] == 0
in repeated_substrings.
That gave me the occasion to write some automated tests.
This is the code that I tested against:
def all_common_substrings(s1, s2):
ans = defaultdict(int)
for i in range(len(s1)):
for j in range(len(s2)):
for k in range(min(len(s1) - i, len(s2) - j)):
if s1[i + k] != s2[j + k]:
break
else:
k += 1
if k:
ans[i + k, j + k] = max(ans[i + k, j + k], k)
return sorted((i - k, j - k, k) for (i, j), k in ans.items())
It made me notice that the logic above is flawed in step 2.
This caused errors in your code as well: in the example you sent me by email, the following matches are duplicates.
duplicate (9981, 1755, 7710) (9950, 1724, 7741)
duplicate (236, 1991, 49) (0, 1755, 7710)
duplicate (1738, 3493, 49) (0, 1755, 7710)
duplicate (10217, 1991, 49) (9950, 1724, 7741)
duplicate (11719, 3493, 49) (9950, 1724, 7741)
That error is caused by the base logic of step 2. To see why it is wrong, let's take a simple example: cadab
and cadbd
. pydna returns [(0, 0, 3), (1, 1, 1), (2, 2, 1), (2, 4, 1), (3, 1, 1), (4, 3, 1)]
. This is wrong because (1, 1, 1) and (2, 2, 1) are implied by (0, 0, 3).
The suffix array is [ 3 1 7 9 4 0 6 10 2 8 5]
The LCP array is [1, 2, 0, 1, 0, 3, 0, 1, 1, 0, 0]
The smart filtering in step 2 makes us return ranges [(5, 7, 3), (0, 3, 1), (3, 5, 1), (7, 10, 1)]
. We ignored range (1, 3, 2)
. This is initially smart because it is a match between positions 1 and 7 (hence 1 of s1) that corresponds to ad
. This match is not maximal because there is cad
.
Even if the range (1, 3, 2)
is ignored, the range (0, 3, 1)
will still test the match between those positions, but with length 1. However, it is not possible anymore to filter efficiently that match on a
knowing the match between cad
because they don't share a beginning or an ending.
I rewrote the function to avoid that filtering at step 2 and add a filtering at step 3.
The new code will be available in version 0.0.8
It made the algorithm even faster in some cases.
<function common_sub_strings at 0x110a358b0> 9 1.7738101482391357 <function common_substrings at 0x110aa0280> 9 0.12374401092529297 <function common_sub_strings at 0x110a358b0> 0 1.5403878688812256 <function common_substrings at 0x110aa0280> 0 0.005788087844848633
<function common_sub_strings at 0x111febf70> 8 1.790924072265625
<function common_substrings at 0x112003940> 8 0.055631160736083984
<function common_sub_strings at 0x111febf70> 0 1.4228060245513916
<function common_substrings at 0x112003940> 0 0.007861137390136719
Hi, I am looking for a faster alternative to a common substrings algorithm for my python bioinformatics package pydna.
I currently use a Python implementation of the Kärkkäinen Sanders suffix array algorithm, mostly for predicting homologous recombination between DNA molecules. This is surprisingly fast, but profiling revealed that it is still a bottleneck.
My background is not CS, so I wonder if anyone could show me how to use pydivsufsort to get all common substrings between two strings longer than some given cutoff.