Closed kootenpv closed 7 years ago
Hi @kootenpv Thanks for pointing this out. I should have clarified that Algorithm is slow for small corpus/small number of keywords.
Only when you have large corpus of Text plus multi thousand keywords does the algorithm makes sense.
Author of Gensim @RadimRehurek found that this algorith is 28x faster than Regex for his corpus. https://twitter.com/RadimRehurek/status/904989624589803520
Here is a link to a gist where I ran a regex on Alice In wonderland text:
https://gist.github.com/vi3k6i5/88ef0612ea3075a49580b1bd40dfdddc
Let me know if this clarifies your doubt. Thanks :)
For Alice in wonderland Text corpus: Doc Length: 8471 characters No of keywords: 1487 words
Regex: 100 loops, best of 3: 17.8 ms per loop FlashText: 100 loops, best of 3: 2.95 ms per loop
the difference will only grow as the corpus size increases and no of keywords increases. This is a linearly growing algorithm.
Yea, you're right... it should have been obvious to me! Thanks for pointing it out.
I think that the only thing that matters is dictionary size, not so much text length, because of avoiding backtracking :)
I'll make sure to remember this implementation the next time I need to find/replace a lot of strings!
This code is actually 12x faster: