Closed remiadon closed 4 years ago
Hello! I tested your code and it works very well for me. One problem is the script throws an UnboundLocalError: local variable 'cost' referenced before assignment
if the input text ends with a symbol like a period, colon, etc. or of the text contains a a symbol like period or colon somewhere.
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 656, in replace_keywords
({}, 0, 0)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
[Previous line repeated 6 more times]
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 802, in _levenshtein_rec
yield node, cost, depth
UnboundLocalError: local variable 'cost' referenced before assignment
@spooknik I will try to update it ASAP For the moment I would say instantiate cost at the beginning at the function (lines 789-791) like
def _levenshtein_rec(self, char, node, word, rows, max_cost, depth=0):
n_columns = len(word) + 1
new_rows = [rows[0] + 1]
cost = 0
This is a mistake from my side, it's usually a bad practice to instantiate a variable in a scope (a for loop in this case) and access it out from the scope ...
Thanks for the reply and the update.
I've added cost = 0
at line 792 but now there is another error:
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 654, in replace_keywords
current_dict_continued, cost, _ = next(
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 801, in _levenshtein_rec
stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'
@spooknik do you have a test case to provide ?
Sure thing. Using your version of keywords.py:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=60)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 654, in replace_keywords
current_dict_continued, cost, _ = next(
File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 786, in levensthein
yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 801, in _levenshtein_rec
stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'
It works if:
keyword_processor.replace_keywords('No. of colours', max_cost=60)
'Número de colores'
@spooknik first thing first, in practice max_cost should not be that big, usually we allow for around 1 to 3 editions on strings if we want to stay robust. Are you sure it makes sense for you to keep a max_cost of 60 ? Anyway I'll try to fix it this week
I think I misunderstood the value of this variable. 1-3 seems like a more sane value. I just want to catch some of the the different spelling variations in English (between UK and US for example).
Thanks for taking a look :+1:
Hello again, small update to the bug report. Upon looking back into this It seems the problem is related with the max_cost variable. For example:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=1)
Número de colores: 10
but if I change max_cost=2 (or higher) then it crashes.
@spooknik I added a test case for this, it should be OK now ;)
Did some quick tests with my data and it works brilliantly now. Thank you so much 👍
Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost
is a ratio rather than number of characters:)? A smaller than 3 max_cost
works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!
I'll chime in with my solution that I found.
I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.
if len(term) <= 3:
keyword_processor.replace_keywords(term)
elif len(term) <= 8:
keyword_processor.replace_keywords(term, max_cost=1)
else:
keyword_processor.replace_keywords(term, max_cost=2)
I agree however max_ratio would be more direct perhaps :)
Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the
max_cost
is a ratio rather than number of characters:)? A smaller than 3max_cost
works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!
Good suggestion. I don't see exactly what you mean though. Would the ratio be relative to the length of the current word, or the length of the the keyword ? If I set max_ratio to 0.3 and try to extract "I do" in "I doo what I love", in the first case that would not match because the distance between "do" and "doo" is 1, divided by 3 (length of "doo") we get 33% I the second case that would work, because a distance of 1 relative to a length of 5 (length of "I doo") is equal to 20%, which is less than the max_ratio of 30%
Didn't think of that... Nice example! Based on the example you mentioned, it would make more sense if the max_ratio is relative to the length of the current word, which allows more matches to be returned. If the result is not ideal, we can always adjust the ratio to achieve higher accuracy.
OK I see. @zheyaf do you have a few examples to provide, so I can make sure this leads to smarter fuzzy matching ?? Concerning the implementation, I should get something working without much effort
Hi! Is it ok that I send some examples to you email?
@zheyaf my point was to use some of them as unittests potentially, can you use github instead ? so we get all the tracking of the PR in the same interface
@remiadon Hi, sorry, I'll put examples here.
s1 = 'XXX Beheer BV Omschrijving: Managemen t F ee april 2018 IBAN: NL12KNAB0222222222 Valutadatum: 24-04-2018'
s2 = 'XXX VAN DORT Omschrijving: VOORSCHOT SALARIS IBAN: NL34RABO888888888 Valutadatum: 04-02-2019'
keywords = ['sal', 'VPB', 'BTW', 'management fee']
The ideal match is exact match for 'sal', 'VPB', 'BTW'
, fuzzy match for 'management fee'
in s1. Thanks!
@zheyaf please provide a minimal example (eg. strip the useless parts in the strings), including function calls and parameters for the match to occur
was this PR merged?
@nkrot unfortunately no I have no answer from the maintainer, despite some people seem to be interested in this feature
Hi,
sorry for so much delay on this, was busy with other stuff going on in life right now. Will take out time this weekend and get this done before Monday IST.
Thanks for your patience.
merged to master.
Thanks everyone for all the help with this, sorry for all the delay. Was inactive on Github for a long time cos of some personal reasons and life 😄
I'll chime in with my solution that I found.
I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.
if len(term) <= 3: keyword_processor.replace_keywords(term) elif len(term) <= 8: keyword_processor.replace_keywords(term, max_cost=1) else: keyword_processor.replace_keywords(term, max_cost=2)
I agree however max_ratio would be more direct perhaps :)
@spooknik just for curiosity what was your exact example ?
Hello @vi3k6i5 , would it be possible to build and push this commit to PyPi too?.. This is a very useful feature and I'm sure many developers would be more than excited to see this pushed to PyPi, me included 😃
@remiadon @vi3k6i5 Can you provide one example on how to enable the fuzzy match feature in the api?
@lingvisa you can have a look in this PR "files changed" section, especially test/test_extract_fuzzy.py
Aside from this I think we still have to make some efforts to make fuzzy matching really usable.
The main part to work on is to allow for a max_ratio
arg to be passed, instead of the current max_cost
, which matches everything for small keywords for example
@remiadon Have you also tested the performance of fuzzy match compared to exact match? Supposedly, will it be several times slower?
@lingvisa I believe so, yes
This PR tries to introduce
fuzzy matching
in flashtext, as mentioned in these issues:Guidelines
Features included
1) KeywordProcessor.extract_keywords and KeywordProcessor.replace_keywords both have a new optional parameter :
max_cost
, which is the maximum levensthein distance accepted to perform fuzzy matching on a single keyword 2) KeywordProcessor implements alevensthein
function, which tries to find a match with a given word,with respect to a provided max_cost
, and returns a node in the trie, which will be used to continue the search 3) a new function has been included :get_next_word
, which just retrieve the next word in the sequence. Tests are included intest/test_kp_next_word.py
Optimizations to keep focus on performance
current cost
(initialized tomax_cost
) every time we trigger fuzzy matching (on a word), so if all the cost frommax_cost
already have been "consumed" by other words in the current keyword, we do not trigger fuzzy matching. E.g : when trying to extract the keyword "here you are" from "heere you are" with amax_cost
of 1, the current cost will gets all consumed after the first word ("heere"), so no fuzzy matching will be performed on the other words ("you" and "are")Limitations