Open EonYang opened 3 years ago
A working code snippet:
from math import floor
from time import perf_counter
from random import randint, random
from wonderwords import RandomWord
sorted_nouns = sorted(RandomWord()._categories['nouns'], key=len)
def bisect_by_length(arr: list, target_length: int):
l, r = 0, len(arr) - 1
while l <= r:
m = l + (r - l) // 2
if len(arr[m]) < target_length:
l = m + 1
else:
r = m - 1
return l
def get_word_by_length(min_length: int, max_length: int):
start = bisect_by_length(sorted_nouns, min_length)
end = bisect_by_length(sorted_nouns, max_length + 1) - 1
i = start + floor(random() * (end - start))
return sorted_nouns[i]
Evaluate it:
def test_get_word_by_length(number: int = 1000):
start_ts = perf_counter()
for _ in range(number):
min_len, max_len = randint(len(sorted_nouns[0]), len(
sorted_nouns[-1])), randint(len(sorted_nouns[0]), len(sorted_nouns[-1]))
if min_len > max_len:
min_len, max_len = max_len, min_len
w = get_word_by_length(min_len, max_len)
if min_len == max_len == 18:
...
# no words in the array has length == 18
# print('min: {}, max: {}, actuall: {}'.format(
# min_len, max_len, len(w)))
else:
assert min_len <= len(w) <= max_len, 'min: {}, max: {}, actuall: {}'.format(
min_len, max_len, len(w))
end_ts = perf_counter()
t = end_ts - start_ts
print(f'{number} loops took {round(t, 4)} seconds')
print(f'each loop took {round(t * 1000 / number, 4)} milliseconds')
test_get_word_by_length(1000000)
# output:
# 1000000 loops took 10.2348 seconds
# each loop took 0.0102 milliseconds
Thank you for your thorough analysis and contribution. I agree; there is much to optimize.
random.choice
?random.choice
requires passing a slice of arr, and slicing the arr takes O(n) time and O(n) space.
Random index only takes O(1) time and O(1) space.
I don't have a good solution for Regex. Regex is not very efficient in certain scenarios. Maybe give it its own API and the user will decide to use it when they don't need to worry about performance.
Thanks for the advice. I created a new branch where I'm going to push optimizations. Only implementing the bisect method gave astronomically better results.
On my computer, it used to go at 5.106667959000333, now timeit shows 0.06474015199637506.
I just implemented a trie. Currently, the system is clunky because for ends_with
it simply generates a second trie for all words in reverse. This means that the matching words will then have to be re-reversed to their original states. I'm thinking of a way to forgo this.
Describe the bug
RandomWord is slow.
To Reproduce
run this:
Output:
Expected behavior
A thousand calls should finish in 1 second.
Additional context
I think the code here is not very efficient.
https://github.com/mrmaxguns/wonderwordsmodule/blob/80dcdf5a716f1a84982cadaf2dededed43788e11/wonderwords/random_word.py#L183
It copies all words from the category, then removing them if they don't match the length criteria. also, removing an item from an array takes O(n) time where n is the length of the array, making the whole operation O(n^2) where n is the numbers of words in the category.
IMO, using
words = set()
instead ofwords = []
could make it O(n) time, but it's still not fast enough.When just requesting one word, we shouldn't even read all words in the category. It's better to pre-sort words in all categories by length, then use bisect to find
start_index
andend_index
that match the length criteria, and just generate a random integer to access read the specific word. This way the overall time complexity is O(log N).The above approach didn't optimize
start_with
andend_with
, which requires introducingTrie
.