louisabraham / pydivsufsort

Python package for string algorithms ➰
MIT License
37 stars 4 forks source link

Performance Concerns #5

Closed seanlaw closed 4 years ago

seanlaw commented 4 years ago

@louisabraham Based on your original Python implementation, I played around with a naive NumPy implementation and, for an input array that is 300_000_000 in length, my NumPy implementation takes around 20 minutes and the libdivsufsort version takes 100 minutes.

Of course, the memory usage of the libdivsufsort is very small (almost an order of magnitude less than the NumPy version) but I expected the computation time to be much, much better.

louisabraham commented 4 years ago

Of course it is supposed to be much better. Did you by chance use a random string? My code is specially optimized for that.

louisabraham commented 4 years ago

Do you have a smaller example I can test on? Please include your numpy code.

Also, the memory usage of my code should be linear with a small constant if you use np.unique and not a dict.

seanlaw commented 4 years ago

I basically modified your example.py with

random_string = np.random.randint(100, size=300_000_000, dtype=np.uint8)

So, it's nothing special but it still took close to 2 hours.

seanlaw commented 4 years ago

The naive NumPy functions (based on your Python implementation) is:

import numpy as np
import time

def to_int_keys_np(l):
    indices = np.unique(l, return_inverse=True)[1]
    return indices

def suffix_array_np(s):
    n = len(s)
    k = 1
    line = to_int_keys_np(s)
    tmp_line = np.ones(n, dtype=np.int64)
    while max(line) < n - 1:
        tmp_line[:] = -1
        tmp_line[:-k] = line[k:]

        line[:] = (n + 1) * line + tmp_line + 1
        line[:] = to_int_keys_np(line)

        k <<= 1

    return line

if __name__ == '__main__':
    inp = np.random.randint(100, size=300_000_000)
    start = time.time()
    sa_np = suffix_array_np(inp)
    print(time.time() - start)
louisabraham commented 4 years ago

Yeah, my implementation uses this really nice trick max(line)<n-1 that basically kills the complexity for random strings.

Could you test for a constant string? Try a smaller number of characters, my algorithm will take much more time.

louisabraham commented 4 years ago

Also, I guess you use a powerful machine with multiple cores. Numpy uses powerful parallelization. You should compile libdivsufsort with openmp activated. To do that, just add the option -DUSE_OPENMP=ON in the cmake call.

louisabraham commented 4 years ago

I really cannot reproduce. There is a loop that repeats the computation 100 times in example.py. Are you sure this is not the cause of your problem?

seanlaw commented 4 years ago

Well, that's embarrassing! I totally overlooked the for-loop. Indeed, that was the issue! I'm glad that was it and not something else. One run took about 67 seconds. Thanks and sorry for the false alarm 👍

louisabraham commented 4 years ago

That's just how performant libdivsufsort is!