mrmaxguns / wonderwordsmodule

Generate random words and sentences with ease in Python. Be on the lookout for bugfixes and speed improvements in 2.3
https://wonderwords.readthedocs.io
MIT License
53 stars 11 forks source link

Slow when specify word length. #14

Open EonYang opened 3 years ago

EonYang commented 3 years ago

Describe the bug

RandomWord is slow.

To Reproduce

run this:

from timeit import timeit
from wonderwords import RandomWord

r = RandomWord()

timeit(lambda: r.word(
    include_parts_of_speech=[
        'nouns',
    ],
    word_min_length=3,
    word_max_length=8), number=100)

Output:

4.55993747399998

Expected behavior

A thousand calls should finish in 1 second.

Additional context

I think the code here is not very efficient.

https://github.com/mrmaxguns/wonderwordsmodule/blob/80dcdf5a716f1a84982cadaf2dededed43788e11/wonderwords/random_word.py#L183

It copies all words from the category, then removing them if they don't match the length criteria. also, removing an item from an array takes O(n) time where n is the length of the array, making the whole operation O(n^2) where n is the numbers of words in the category.

IMO, using words = set() instead of words = [] could make it O(n) time, but it's still not fast enough.

When just requesting one word, we shouldn't even read all words in the category. It's better to pre-sort words in all categories by length, then use bisect to find start_index and end_index that match the length criteria, and just generate a random integer to access read the specific word. This way the overall time complexity is O(log N).

The above approach didn't optimize start_with and end_with, which requires introducing Trie.

EonYang commented 3 years ago

A working code snippet:

from math import floor
from time import perf_counter
from random import randint, random
from wonderwords import RandomWord

sorted_nouns = sorted(RandomWord()._categories['nouns'], key=len)

def bisect_by_length(arr: list, target_length: int):
    l, r = 0, len(arr) - 1
    while l <= r:
        m = l + (r - l) // 2
        if len(arr[m]) < target_length:
            l = m + 1
        else:
            r = m - 1
    return l

def get_word_by_length(min_length: int, max_length: int):
    start = bisect_by_length(sorted_nouns, min_length)
    end = bisect_by_length(sorted_nouns, max_length + 1) - 1
    i = start + floor(random() * (end - start))
    return sorted_nouns[i]

Evaluate it:

def test_get_word_by_length(number: int = 1000):
    start_ts = perf_counter()
    for _ in range(number):
        min_len, max_len = randint(len(sorted_nouns[0]), len(
            sorted_nouns[-1])), randint(len(sorted_nouns[0]), len(sorted_nouns[-1]))
        if min_len > max_len:
            min_len, max_len = max_len, min_len
        w = get_word_by_length(min_len, max_len)
        if min_len == max_len == 18:
            ...
            # no words in the array has length == 18
            # print('min: {}, max: {}, actuall: {}'.format(
            #     min_len, max_len, len(w)))
        else:
            assert min_len <= len(w) <= max_len, 'min: {}, max: {}, actuall: {}'.format(
                min_len, max_len, len(w))
    end_ts = perf_counter()
    t = end_ts - start_ts
    print(f'{number} loops took {round(t, 4)} seconds')
    print(f'each loop took {round(t * 1000 / number, 4)} milliseconds')

test_get_word_by_length(1000000)

# output:
# 1000000 loops took 10.2348 seconds
# each loop took 0.0102 milliseconds
mrmaxguns commented 3 years ago

Thank you for your thorough analysis and contribution. I agree; there is much to optimize.

  1. I agree that the categories should be pre-sorted. This could be done when an instance of the class is created.
  2. Why is generating a random integer index faster than random.choice?
  3. I agree with the idea to implement a trie.
  4. Where would the rest of the categorizations (such as RegEx) fit in?
EonYang commented 3 years ago

random.choice requires passing a slice of arr, and slicing the arr takes O(n) time and O(n) space. Random index only takes O(1) time and O(1) space.

I don't have a good solution for Regex. Regex is not very efficient in certain scenarios. Maybe give it its own API and the user will decide to use it when they don't need to worry about performance.

mrmaxguns commented 3 years ago

Thanks for the advice. I created a new branch where I'm going to push optimizations. Only implementing the bisect method gave astronomically better results.

On my computer, it used to go at 5.106667959000333, now timeit shows 0.06474015199637506.

mrmaxguns commented 3 years ago

I just implemented a trie. Currently, the system is clunky because for ends_with it simply generates a second trie for all words in reverse. This means that the matching words will then have to be re-reversed to their original states. I'm thinking of a way to forgo this.