louisabraham / pydivsufsort

Python package for string algorithms ➰
MIT License
38 stars 4 forks source link

Add Function to Automatically Transform Input To Use The Smallest dtype #6

Closed seanlaw closed 4 years ago

seanlaw commented 4 years ago

According to #3

a type that uses k bits will need approximately k more time so try to use the smallest type you can

It would be nice to add a function to handle this automatically behind the scenes

louisabraham commented 4 years ago

The optimal solution (in terms of dtype size) would be to use np.unique. However, a simpler solution that just tests on inp.max() - inp.min() will be faster. Since most users don't have a small number of different values covering a big range, this solution is the best.

seanlaw commented 4 years ago

I would agree with this. I think we just need to show some documentation or a good example that demonstrates how much of a waste it is to have an array like:

x = np.random.randint(0, 2, size=100)
x[0] = 1e30

And one should convert this to integers between 0 and 2 (inclusive).

louisabraham commented 4 years ago
from time import time
from pydivsufsort import divsufsort

n = 1_000_000

random_string = np.random.randint(255, size=n, dtype=np.uint8)

d = time()
divsufsort(random_string)
print(time() - d)

random_string = random_string.astype(np.uint64)

d = time()
divsufsort(random_string)
print(time() - d)

random_string[0] = -1

d = time()
divsufsort(random_string)
print(time() - d)
0.2969942092895508
0.3129754066467285
1.618661880493164

On such a string, np.unique takes about 120 ms so it's not worth it.

louisabraham commented 4 years ago

Sorry, github is currently experiencing problems and I deleted your double comment. It looks like both are gone now.

seanlaw commented 4 years ago

On such a string, np.unique takes about 120 ms so it's not worth it.

I can't tell from this comment if you think this is too fast or slow? IIRC, np.unique sorts the array first but, technically, we don't need sorting

louisabraham commented 4 years ago

120 ms is comparable to the cost of divsufsort on a string with the same dtype. Hence the user will pay in most cases a non negligible cost that there is no way to avoid. There is no way to make np.unique faster or to use it without sorting.

The current version includes an optimization that will work in most cases and doesn't cost much.

Why did you reopen the issue? To remember to put that in an example?

louisabraham commented 4 years ago

I could add a parameter to launch np.unique but just doing it will take the same amount of code, and I want the interface to be simple.

seanlaw commented 4 years ago

Cool. I'm with you and agree that a simple interface is preferred.

Sorry for re-opening. I couldn't comment without re-opening. For issues that aren't closed by an actual commit, I tend to prefer to leave it open for at least 14 days (2 weeks) to allow for discussion and additional thoughts/considerations. After 2 weeks, I usually close and make a comment like "feel free to re-open..."

Not trying to dictate here but this approach might be useful in order to avoid constantly opening and closing issues just to be able to comment. It would reduce the amount of email notifications.