Open Themandunord opened 6 years ago
No update ? :)
yes. I am also looking for a way to implement fuzzy match with Flash text.
How about we embed FuzzyWuzzy into the code? its the good library that I have come across for Entity extraction.
@ssameerr I don't think FuzzYWuzzy is compatible with flashtext, as flashtext implements its own Trie data structure, and FuzzyWuzzy is not aware of this kind of data structure
Recently I have been working on fuzzy matching implemention (via levenshtein distance
) in flashtext.
The following is a description for bringing levenstein distance integration on the extract_keywords
method, but IMO similar concepts can be applied to 'replace_keywords'
My general idea is to rely on the current algorithm the most we can
, which means we do not change the way we iterate over the trie, we only "trigger" fuzzy mathing when we reach a non-matching character --> when there is an error while descending in self.keyword_trie_dict
Then we try to find the best entry (minimum levenstein distance), with respect to a max_cost
parameter (e.g max_cost=2 allows 2 edits)
While this guideline is common to all the approaches I have tried, I came up with two different solutions for "correcting" the input, each having its own pros and cons
I take the rest of the word until the next delimiter
then I give this word_to_match
to my levensthein function, which retrieves the node in self.keyword_trie_dict
with the lowest levensthein dist
Theoritically in terms of performance this is not that bad, because we only fed our distance function with a single word.
On the other hand, this may lead to low quality end results. As the keyword we want to retrieve may be composed on many words, we may have "corrected" single words locally, i.e we stacked local minima, so the keyword we retrieve is not a global minima but an addition of local (per word) minima ... pseudo code:
# idx in the index at which we have the first non-matching character between our input string and the nodes in Trie (~ none of the nodes contains the current char)
# current_dict* is current_dict or current_dict_continued, depending on where we branched
word_to_match = str.partition(sentence[idx:])[0]
closest_node, cost, depth = min(self._correct_word(word_to_match, max_cost=max_cost, start_node=current_dict_*))
In the second solution, I pass the entire sequence left to the levenshtein function,
and try to retrieve the node having self._keyword
in its keys with the minimum leveshtein distance
# idx in the index at which we have the first non-matching character between our input string and the nodes in Trie (~ none of the nodes contains the current char)
# current_dict* is current_dict or current_dict_continued, depending on where we branched
seq_to_match = sentence[idx:]
closest_node, cost, depth = min(self._correct_word(seq_to_match, max_cost=curr_cost, start_node=current_dict_*))
To sumarize, the first solution is better in terms of performance (relying more on the current algorithm than the second), the second one is a proper implemention of levenshtein distance between keywords (and not just words), but will lead to a deacrease concerning performance
Which one you think would be the best for flashtext ? Has anyone a better solution ?
NB.
The default value for the max_cost
is 0, so if max_cost stays untouched, I do not event trigger fuzy matching, which makes it strictly equivalent to the current algorithm
@Themandunord Fuzzy matching for extraction has been added in the PR mentioned above I am still working on replacement
Hi !
Thanks for this project :)
It can be cool and amazing if you support the same algorithm from IBM Watson conversation when we activated the Fuzzy Matching Option (https://console.bluemix.net/docs/services/conversation/entities.html#defining-entities).
If you have this feature, your tools can be the best tools for entity extraction !
Thanks :)