example is 12x slower than regex

vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.

MIT License

5.6k stars 601 forks source link

Closed kootenpv closed 7 years ago

kootenpv commented 7 years ago

This code is actually 12x faster:

import re
reg = re.compile("Big Apple|New York")
for line in big_text: # wikipedia dump
    reg.findall(line)

vi3k6i5 commented 7 years ago

Hi @kootenpv Thanks for pointing this out. I should have clarified that Algorithm is slow for small corpus/small number of keywords.

Only when you have large corpus of Text plus multi thousand keywords does the algorithm makes sense.

Author of Gensim @RadimRehurek found that this algorith is 28x faster than Regex for his corpus. https://twitter.com/RadimRehurek/status/904989624589803520

Here is a link to a gist where I ran a regex on Alice In wonderland text:

Let me know if this clarifies your doubt. Thanks :)

vi3k6i5 commented 7 years ago

For Alice in wonderland Text corpus: Doc Length: 8471 characters No of keywords: 1487 words

Regex: 100 loops, best of 3: 17.8 ms per loop FlashText: 100 loops, best of 3: 2.95 ms per loop

the difference will only grow as the corpus size increases and no of keywords increases. This is a linearly growing algorithm.

kootenpv commented 7 years ago

Yea, you're right... it should have been obvious to me! Thanks for pointing it out.

I think that the only thing that matters is dictionary size, not so much text length, because of avoiding backtracking :)

I'll make sure to remember this implementation the next time I need to find/replace a lot of strings!