louisabraham / pydivsufsort

Python package for string algorithms ➰
MIT License
38 stars 4 forks source link

Common substrings? #42

Closed BjornFJohansson closed 1 year ago

BjornFJohansson commented 1 year ago

Hi, I am looking for a faster alternative to a common substrings algorithm for my python bioinformatics package pydna.

I currently use a Python implementation of the Kärkkäinen Sanders suffix array algorithm, mostly for predicting homologous recombination between DNA molecules. This is surprisingly fast, but profiling revealed that it is still a bottleneck.

My background is not CS, so I wonder if anyone could show me how to use pydivsufsort to get all common substrings between two strings longer than some given cutoff.

louisabraham commented 1 year ago

Hi Bjorn!

I think suffix arrays could indeed help and I would be glad to spend some time helping on this.

Could you give some concrete example? The best would be to upload a text file with some input-output pairs. But a small toy example could be useful too! In particular, what do you do with overlapping patterns?

BjornFJohansson commented 1 year ago

Hi and thanks for the quick reply. I pasted the docstring from the module I use for discovering common substrings between two strings longer than a specified limit.

Finds all common substrings between stringx and stringy
longer than limit. This function is case sensitive.
The substrings may overlap.

returns a list of tuples describing the substrings
The list is sorted longest -> shortest.

Parameters
----------
stringx : str
stringy : str
limit : int, optional

Returns
-------
list of tuple
    [(startx1, starty1, length1),(startx2, starty2, length2), ...]

    startx1 = startposition in x, where substring 1 starts
    starty1 = position in y where substring 1 starts
    length1 = lenght of substring

Examples
--------

>>> from pydna.common_sub_strings import common_sub_strings
>>> common_sub_strings("gatgatttcggtagtta", "gtcagtatgtctatctatcgcg", limit=3)
[(1, 6, 3), (7, 17, 3), (10, 4, 3), (12, 3, 3)]

::

    Overlaps   Symbols
    (1, 6,  3)   ---
    (7, 17, 3)   +++
    (10, 4, 3)   ...
    (12, 3, 3)   ===

                ===
    gatgatttcggtagtta           stringx
     ---   +++...

        ...
    gtcagtatgtctatctatcgcg      stringy
       ===---        +++
BjornFJohansson commented 1 year ago

I should perhaps mention that most strings used with this algorithm rarely are longer than 10K chars and that most shared substrings are ~30 to 50 chars long.

louisabraham commented 1 year ago

Hello, I looked at your algorithm and it is very interesting. I implemented it with a twist in the common_substrings branch that is being tested.

Existing algorithm

First, one concatenates the two strings with a separator, eg ananas&banana. After computing the suffix array using the Skew algorithn and the LCP array using Kasai's algorithm, there is a nice processing going on:

  1. the LCP array is read with a stack to divide it in segments.

For example, the LCP array 1,3,5,3,1,0,0,2,4,2,0,0,0 has 5 segments corresponding to "layers"

  5
  -     4
 333    -
 ---   222
11111  ---
-------------
  1. Those segments are filtered to remove those that are associated with the same endings. For example, the 333 segment corresponds to the 4 occurrences of ana and the 22 segment corresponds to the 4 occurrences of na. Hence, the 22 segment should be filtered out. This is done by a dictionnary that uses the ending position in the string and the length of the segment in the LCP array as keys. It is easy to show that two segments will share those two characteristics if and only if they are associated with the same matches. For the LCP array above, we get the segments 11111, 333 and 5.

  2. Each segment corresponds to a substring that appears k+1 times where k is the length of the segment in the LCP array. However, we are only interested in the substrings that appear in both strings. Thus, we have to divide the k+1 occurrences in 2 lists, one for each input string. Finally, we take the product of those two lists and take care of pairs appearing multiple times, when multiple segments overlap: the middle 3 corresponds to the first 3 letters of the 5 segment anana. Thus, we use another dictionnary to make sure that in doubt we only consider the largest substring.

Implementation

I reimplemented it, reusing the existing code for computing suffix arrays and the LCP array.

The code is super fast, because libdivsufsort is much faster than any other implementation, but my reimplementation of the above algorithm is also super fast because I used Cython. Furthermore, I used C++ containers like a vector for the stack and I think it really helped with speed.

Modification of the algorithm

The original implementation can be improved at step 3. When considering a segment of the LCP array, one does a potentially useless job of going through that segment to store the matches of the first and second string. This job leads sometimes to nothing if all the matches are in the same string.

There is an easy way to check that in O(1) time: pre-compute (very fast) the cumulative sum cumsum of the array suffix_array < len(s1). This boolean array tells us where the matches appear. When considering a span (start, end), we compute diff = cumsum[end] - cumsum[start]. If diff is either 0 or end - start, then all substrings in the segment belong to the same string and we can skip the whole substring.

Performance

I did two tests, one with random strings, and one to illustrate the complexity improvement.

from time import time
import numpy as np

from pydna.common_sub_strings import common_sub_strings
from pydivsufsort import common_substrings

def profile(fun, s1, s2):
    ans = float("inf")
    for _ in range(3):
        t = time()
        solutions = len(fun(s1, s2, limit=15))
        ans = min(ans, time() - t)

    print(fun, solutions, ans)

size = 100_000
s1 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))
s2 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))

profile(common_sub_strings, s1, s2)
profile(common_substrings, s1, s2)

reps = 2_000
s1 = "banana" * reps
s2 = "ananas" * reps
profile(common_sub_strings, s1, s2)
profile(common_substrings, s1, s2)
<function common_sub_strings at 0x110a358b0> 9 1.7738101482391357
<function common_substrings at 0x110aa0280> 9 0.12374401092529297
<function common_sub_strings at 0x110a358b0> 0 1.5403878688812256
<function common_substrings at 0x110aa0280> 0 0.005788087844848633

This implementation is about 15x faster than the previous one, but the improvement is several hundreds when the cumsum array is used.

louisabraham commented 1 year ago

Note that a nice bonus from the cumsum idea is that there is a linear-time algorithm to count the number of common substrings more than some limit: ignore the segments with value less than the limit and the segments that are included in other segments. Then for each of the remaining segments, add diff * (end - start - diff).

BjornFJohansson commented 1 year ago

Thanks! Ill try to run it on my machine right now.

BjornFJohansson commented 1 year ago

For anyone seeing this exchange, this is not necessary, simply pip install -U pydivsufsort should be enough to install all dependencies. See further down.

After some tinkering I managed to run the code by first pip installing from Github:

(bjorn311) bjorn@bjorn-ThinkPad-T450s:~$ pip install 'pydivsufsort @ git+https://github.com/louisabraham/pydivsufsort.git'
Collecting pydivsufsort@ git+https://github.com/louisabraham/pydivsufsort.git
  Cloning https://github.com/louisabraham/pydivsufsort.git to /tmp/pip-install-z195l684/pydivsufsort_7cabef1830ac45fbbfba2b905616102d
  Running command git clone --filter=blob:none --quiet https://github.com/louisabraham/pydivsufsort.git /tmp/pip-install-z195l684/pydivsufsort_7cabef1830ac45fbbfba2b905616102d
  Resolved https://github.com/louisabraham/pydivsufsort.git to commit 1723ca81204ac7b1b5ee1cfc51fbc5d51794054f
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: wheel in ./anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort@ git+https://github.com/louisabraham/pydivsufsort.git) (0.38.4)
Requirement already satisfied: numpy in ./anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort@ git+https://github.com/louisabraham/pydivsufsort.git) (1.24.3)
Building wheels for collected packages: pydivsufsort
  Building wheel for pydivsufsort (pyproject.toml) ... done
  Created wheel for pydivsufsort: filename=pydivsufsort-0.0.7-cp311-cp311-linux_x86_64.whl size=252555 sha256=774ed2b753df61078a9d4ced0928cec339cc5140b05add43dd2b9d686b0f9022
  Stored in directory: /tmp/pip-ephem-wheel-cache-64q45mh7/wheels/97/0d/79/d416fadd80786b875198d517b82c2adec09181b0024704ec41
Successfully built pydivsufsort
Installing collected packages: pydivsufsort
  Attempting uninstall: pydivsufsort
    Found existing installation: pydivsufsort 0.0.6
    Uninstalling pydivsufsort-0.0.6:
      Successfully uninstalled pydivsufsort-0.0.6
Successfully installed pydivsufsort-0.0.7
(bjorn311) bjorn@bjorn-ThinkPad-T450s:~$ 

I then dumped the libdivsufsort and libdivsufsort64 files where the dll.py can find them:

image

I got the files from a Bioconda package (libdivsufsort-2.0.2-hec16e2b_7.tar.bz2) here: https://anaconda.org/bioconda/libdivsufsort/files

There is probably way to make use of the Bioconda package, but I couldn't yet figure out how. Unfortunately, Bioconda only builds for Linux and Mac.

In general a conda/mamba package would be a good idea, since I suspect many data scientists use Anaconda.

Anyway, it worked very well:

runfile('/home/bjorn/Desktop/python_packages/pydna/scripts/LongestCommonSubstring/lcs.py')
Reloaded modules: pydna._pretty, pydna.codon, pydna.utils, pydna, pydna.common_sub_strings
<function common_sub_strings at 0x7f363c217600> 10 1.4096662998199463
<function common_substrings at 0x7f363c214c20> 10 0.1297307014465332
<function common_sub_strings at 0x7f363c217600> 0 0.8529350757598877
<function common_substrings at 0x7f363c214c20> 0 0.006013154983520508

A really impressive speed up!

louisabraham commented 1 year ago

What are you doing??? The whole point of this repository is precisely to save you from all that work by compiling wheels for all platforms. Did you try simply installing using pip? pip install -U pydivsufsort

We have extensive testing on a grid of all versions of Python from 3.7 to 3.11 and all 3 OS, with both the 32 and 64 bits versions of Windows.

I think the wheels were not compiled for Python 3.11 before yesterday but now it should be ok. In any case, setup.py is also supposed to allow you to compile the whole project if you also clone the git submodule of libdivsufsort.

BjornFJohansson commented 1 year ago

Seems that the two functions return the same results in slightly different order. Pydna sorts by length of the shared sequence, longest first.

size = 100_000
s1 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))
s2 = "".join(map(str, np.random.randint(0, 4, size=size, dtype=np.uint8)))

common_sub_strings(s1, s2, limit=15)
Out[39]: 
[(50477, 91027, 17),
 (5624, 93369, 16),
 (48148, 32086, 16),
 (13103, 86016, 15),
 (35242, 95651, 15),
 (56419, 89541, 15)]

common_substrings(s1, s2, limit=15 )
Out[40]: 
[(5624, 93369, 16),
 (13103, 86016, 15),
 (35242, 95651, 15),
 (48148, 32086, 16),
 (50477, 91027, 17),
 (56419, 89541, 15)]
louisabraham commented 1 year ago

Did you succeed with the one-line installation?

Indeed. I updated the docstring to match that behavior. You can just add a sort:

substrings.sort(key=lambda x: x[2], reverse=True)
BjornFJohansson commented 1 year ago

Thank you , worked perfectly! I did the fiddling because the first install was unsuccessful (pip install pydivsufsort ). On import, dll.py complained about missing files on linux.

On inspection, the installed dir had no libdivsufsort or libdivsufsort64 files, not sure why.

Now, they are in fact there:

image

(bjorn311) ✔ ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21] 
09:50 $ pip install -U pydivsufsort
Collecting pydivsufsort
  Using cached pydivsufsort-0.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
Requirement already satisfied: wheel in /home/bjorn/anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort) (0.38.4)
Requirement already satisfied: numpy in /home/bjorn/anaconda3/envs/bjorn311/lib/python3.11/site-packages (from pydivsufsort) (1.24.3)
Installing collected packages: pydivsufsort
Successfully installed pydivsufsort-0.0.7
(bjorn311) ✔ ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21] 
09:51 $ python
Python 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pydivsufsort
>>> 
[1]+  Stopped                 python
(bjorn311) ✘-TSTP ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21] 
09:52 $ python lcs.py 
<function common_sub_strings at 0x7f90f1cd0e00> 13 1.3926060199737549
<function common_substrings at 0x7f90f1cf0d60> 13 0.1265397071838379
<function common_sub_strings at 0x7f90f1cd0e00> 0 0.8528392314910889
<function common_substrings at 0x7f90f1cf0d60> 0 0.006813764572143555
(bjorn311) ✔ ~/python_packages/pydna/scripts/LongestCommonSubstring [smallest_rot L|●1…1⚑ 21] 
09:52 $ 
BjornFJohansson commented 1 year ago

Tested with the pydna tests. The first with th original code and the second with pydivsufsort common_substrings, no other changes!

==================================================================== slowest 10 durations ====================================================================
9.26s call     tests/test_module_assembly.py::test_MXblaster1
7.54s call     tests/test_module_assembly.py::test_assembly
3.23s call     tests/test_module_dseqrecord.py::test_synced
2.39s call     tests/test_module_assembly.py::test_marker_replacement_on_plasmid
2.17s call     tests/test_module_assembly.py::test_assemble_pGUP1
2.09s call     tests/test_module_dseqrecord.py::test_map_pCR_MCT1_HA46
1.90s call     tests/test_module_dseqrecord.py::test_features_change_ori
1.70s setup    tests/jupyter_test_repr.ipynb::Cell 0
1.53s setup    tests/format_sequences.ipynb::Cell 0
1.48s call     tests/test_module_dseqrecord.py::test_assemble_YEp24PGK_XK

==================================================================== slowest 10 durations ====================================================================
1.91s call     tests/test_module_dseqrecord.py::test_features_change_ori
1.87s call     tests/test_module_assembly.py::test_MXblaster1
1.61s setup    tests/format_sequences.ipynb::Cell 0
1.40s setup    tests/jupyter_test_repr.ipynb::Cell 0
1.17s setup    tests/save_changed_seq.ipynb::Cell 0
1.05s call     tests/jupyter_test_repr.ipynb::Cell 25
0.97s call     tests/test_module_dseqrecord.py::test_synced
0.94s call     tests/test_module_genbankfixer.py::test_pydna_gbtext_clean
0.91s call     tests/jupyter_test_repr.ipynb::Cell 29
0.83s call     tests/jupyter_test_repr.ipynb::Cell 31
louisabraham commented 1 year ago

On import, dll.py complained about missing files on linux.

Yes that was because I didn't include python 3.11 in the task to build the wheels. All good now I guess!

In case you have time, I'm interested to know what part made it faster: suffix array computation, compilation of the post-processing or the cumsum idea. From your measurements, I would guess that the cumsum idea was important because the order of the tests changed. But I guess other computations are involved because test_features_change_ori didn't change at all.

Btw, I just did a Zenodo release. If you use pydivsufsort in a publication, a citation would be much appreciated!

louisabraham commented 1 year ago

I think this issue can be closed now!

If there are other string algorithms that are bottlenecks in your code, please post other issues and I will see how I can add them to pydivsufsort.

You can also write me an email!

louisabraham commented 1 year ago

Note that a nice bonus from the cumsum idea is that there is a linear-time algorithm to count the number of common substrings more than some limit: ignore the segments with value less than the limit and the segments that are included in other segments. Then for each of the remaining segments, add diff * (end - start - diff).

Thanks to cumsum, the new algorithm is O(n + m) with m the number of matches. Previously it was O(n^2) in some worst-case situations.

Also, for O(n) it is possible to give a list (length, count1, count2, pos1), just storing for each common substring its length, the number of times it appears in s1 and s2 and one occurrence in s1. One could also add the positions in the suffix array so that you have a simple algorithm to then retrieve all pairs of occurrences if that common substring is useful to you.

This could be nice to do data exploration! What makes the complexity O(n+m) is the enumeration of ALL pairs. The limit parameter sure helps but then you have to decide on its value to control m while keeping useful results.

louisabraham commented 1 year ago

Hi Björn,

Following your email, I found the bug: I skipped if lcp[idx] > 0 instead of if lcp[idx] == 0 in repeated_substrings.

That gave me the occasion to write some automated tests.

This is the code that I tested against:

def all_common_substrings(s1, s2):
    ans = defaultdict(int)
    for i in range(len(s1)):
        for j in range(len(s2)):
            for k in range(min(len(s1) - i, len(s2) - j)):
                if s1[i + k] != s2[j + k]:
                    break
            else:
                k += 1
            if k:
                ans[i + k, j + k] = max(ans[i + k, j + k], k)

    return sorted((i - k, j - k, k) for (i, j), k in ans.items())

It made me notice that the logic above is flawed in step 2.

This caused errors in your code as well: in the example you sent me by email, the following matches are duplicates.

duplicate (9981, 1755, 7710) (9950, 1724, 7741)
duplicate (236, 1991, 49) (0, 1755, 7710)
duplicate (1738, 3493, 49) (0, 1755, 7710)
duplicate (10217, 1991, 49) (9950, 1724, 7741)
duplicate (11719, 3493, 49) (9950, 1724, 7741)

That error is caused by the base logic of step 2. To see why it is wrong, let's take a simple example: cadab and cadbd. pydna returns [(0, 0, 3), (1, 1, 1), (2, 2, 1), (2, 4, 1), (3, 1, 1), (4, 3, 1)]. This is wrong because (1, 1, 1) and (2, 2, 1) are implied by (0, 0, 3).

The suffix array is [ 3 1 7 9 4 0 6 10 2 8 5]

The LCP array is [1, 2, 0, 1, 0, 3, 0, 1, 1, 0, 0]

The smart filtering in step 2 makes us return ranges [(5, 7, 3), (0, 3, 1), (3, 5, 1), (7, 10, 1)]. We ignored range (1, 3, 2). This is initially smart because it is a match between positions 1 and 7 (hence 1 of s1) that corresponds to ad. This match is not maximal because there is cad.

Even if the range (1, 3, 2) is ignored, the range (0, 3, 1) will still test the match between those positions, but with length 1. However, it is not possible anymore to filter efficiently that match on a knowing the match between cad because they don't share a beginning or an ending.

I rewrote the function to avoid that filtering at step 2 and add a filtering at step 3.

The new code will be available in version 0.0.8

louisabraham commented 1 year ago

It made the algorithm even faster in some cases.

<function common_sub_strings at 0x110a358b0> 9 1.7738101482391357
<function common_substrings at 0x110aa0280> 9 0.12374401092529297
<function common_sub_strings at 0x110a358b0> 0 1.5403878688812256
<function common_substrings at 0x110aa0280> 0 0.005788087844848633
<function common_sub_strings at 0x111febf70> 8 1.790924072265625
<function common_substrings at 0x112003940> 8 0.055631160736083984
<function common_sub_strings at 0x111febf70> 0 1.4228060245513916
<function common_substrings at 0x112003940> 0 0.007861137390136719