Closed seanlaw closed 4 years ago
The optimal solution (in terms of dtype size) would be to use np.unique. However, a simpler solution that just tests on inp.max() - inp.min()
will be faster.
Since most users don't have a small number of different values covering a big range, this solution is the best.
I would agree with this. I think we just need to show some documentation or a good example that demonstrates how much of a waste it is to have an array like:
x = np.random.randint(0, 2, size=100)
x[0] = 1e30
And one should convert this to integers between 0
and 2
(inclusive).
from time import time
from pydivsufsort import divsufsort
n = 1_000_000
random_string = np.random.randint(255, size=n, dtype=np.uint8)
d = time()
divsufsort(random_string)
print(time() - d)
random_string = random_string.astype(np.uint64)
d = time()
divsufsort(random_string)
print(time() - d)
random_string[0] = -1
d = time()
divsufsort(random_string)
print(time() - d)
0.2969942092895508
0.3129754066467285
1.618661880493164
On such a string, np.unique
takes about 120 ms so it's not worth it.
Sorry, github is currently experiencing problems and I deleted your double comment. It looks like both are gone now.
On such a string, np.unique takes about 120 ms so it's not worth it.
I can't tell from this comment if you think this is too fast or slow? IIRC, np.unique
sorts the array first but, technically, we don't need sorting
120 ms is comparable to the cost of divsufsort on a string with the same dtype. Hence the user will pay in most cases a non negligible cost that there is no way to avoid. There is no way to make np.unique faster or to use it without sorting.
The current version includes an optimization that will work in most cases and doesn't cost much.
Why did you reopen the issue? To remember to put that in an example?
I could add a parameter to launch np.unique but just doing it will take the same amount of code, and I want the interface to be simple.
Cool. I'm with you and agree that a simple interface is preferred.
Sorry for re-opening. I couldn't comment without re-opening. For issues that aren't closed by an actual commit, I tend to prefer to leave it open for at least 14 days (2 weeks) to allow for discussion and additional thoughts/considerations. After 2 weeks, I usually close and make a comment like "feel free to re-open..."
Not trying to dictate here but this approach might be useful in order to avoid constantly opening and closing issues just to be able to comment. It would reduce the amount of email notifications.
According to #3
It would be nice to add a function to handle this automatically behind the scenes