vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 598 forks source link

Support Fuzzy Matching #31

Open Themandunord opened 6 years ago

Themandunord commented 6 years ago

Hi !

Thanks for this project :)

It can be cool and amazing if you support the same algorithm from IBM Watson conversation when we activated the Fuzzy Matching Option (https://console.bluemix.net/docs/services/conversation/entities.html#defining-entities).

If you have this feature, your tools can be the best tools for entity extraction !

Thanks :)

Themandunord commented 6 years ago

No update ? :)

sachinrawat1903 commented 5 years ago

yes. I am also looking for a way to implement fuzzy match with Flash text.

ssameerr commented 5 years ago

How about we embed FuzzyWuzzy into the code? its the good library that I have come across for Entity extraction.

remiadon commented 5 years ago

@ssameerr I don't think FuzzYWuzzy is compatible with flashtext, as flashtext implements its own Trie data structure, and FuzzyWuzzy is not aware of this kind of data structure

remiadon commented 5 years ago

Recently I have been working on fuzzy matching implemention (via levenshtein distance) in flashtext. The following is a description for bringing levenstein distance integration on the extract_keywords method, but IMO similar concepts can be applied to 'replace_keywords'

My general idea is to rely on the current algorithm the most we can, which means we do not change the way we iterate over the trie, we only "trigger" fuzzy mathing when we reach a non-matching character --> when there is an error while descending in self.keyword_trie_dict Then we try to find the best entry (minimum levenstein distance), with respect to a max_cost parameter (e.g max_cost=2 allows 2 edits)

While this guideline is common to all the approaches I have tried, I came up with two different solutions for "correcting" the input, each having its own pros and cons

First solution

I take the rest of the word until the next delimiter then I give this word_to_match to my levensthein function, which retrieves the node in self.keyword_trie_dict with the lowest levensthein dist Theoritically in terms of performance this is not that bad, because we only fed our distance function with a single word.

On the other hand, this may lead to low quality end results. As the keyword we want to retrieve may be composed on many words, we may have "corrected" single words locally, i.e we stacked local minima, so the keyword we retrieve is not a global minima but an addition of local (per word) minima ... pseudo code:

# idx in the index at which we have the first non-matching character between our input string and the nodes in Trie (~ none of the nodes contains the current char)
# current_dict* is current_dict or current_dict_continued, depending on where we branched
word_to_match = str.partition(sentence[idx:])[0]
closest_node, cost, depth = min(self._correct_word(word_to_match, max_cost=max_cost, start_node=current_dict_*))

Second solution

In the second solution, I pass the entire sequence left to the levenshtein function, and try to retrieve the node having self._keyword in its keys with the minimum leveshtein distance

# idx in the index at which we have the first non-matching character between our input string and the nodes in Trie (~ none of the nodes contains the current char)
# current_dict* is current_dict or current_dict_continued, depending on where we branched
seq_to_match = sentence[idx:]
closest_node, cost, depth = min(self._correct_word(seq_to_match, max_cost=curr_cost, start_node=current_dict_*))

To sumarize, the first solution is better in terms of performance (relying more on the current algorithm than the second), the second one is a proper implemention of levenshtein distance between keywords (and not just words), but will lead to a deacrease concerning performance

Which one you think would be the best for flashtext ? Has anyone a better solution ?

NB. The default value for the max_cost is 0, so if max_cost stays untouched, I do not event trigger fuzy matching, which makes it strictly equivalent to the current algorithm

remiadon commented 5 years ago

@Themandunord Fuzzy matching for extraction has been added in the PR mentioned above I am still working on replacement