vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.58k stars 598 forks source link

Flashtext seem inferior to hash table - am I missing something? #106

Open Make42 opened 4 years ago

Make42 commented 4 years ago

Maybe I missed something, but I do not understand what the advantage of using the tie dictionary is over using a hashing dictionary is. Can you tell me what I am missing?

Let K denote the number of keywords in the corpus and Q the number of query words for which you request the check for.

I am guessing that https://www.freecodecamp.org/news/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f/ reflects the time per query. So we have query time complexity of O(Q K). I actually think this is weird - instinctually I would have expected O(Q log(K)). Anyhow, just building a hash table would take O(Q).

Regarding build the respective data structures, flashtext should probably take something like Q(K * log(K). The first K is because you need to insert K words for which you need to query the tries data structure. However, since the trie is small at the beginning I put the second K into log. In opposite, the hash table only requires O(1), so we get O(K) in total for inserting the K words.

Assuming I am right: A way to make flashtext useful could be if you build two tries: One for the query words and one for the dictionary. Building the tries would take O(Q log(Q) + K log(K)), which would be slower than O(K) for the hash table. But querying would take about O(log(K) * log(Q)), which would be better the O(Q) if Q is large enough and and K is small enough.

Another application for the principles might be for finding longest common substrings.

These two ideas for improvement would of course need a bit more implementation.

thakur-nandan commented 4 years ago

Hi @Make42

Regarding the advantages of using the trie dictionary is over a hashing dictionary check this StackOverflow answer- https://stackoverflow.com/questions/245878/how-do-i-choose-between-a-hash-table-and-a-trie-prefix-tree.

Kind Regards, Nandan Thakur