Fuzzy matching - Githubissues

remiadon commented 5 years ago

This PR tries to introduce fuzzy matching in flashtext, as mentioned in these issues:

Guidelines

rely the most we can on the existing algorithm : only trigger fuzzy matching when we have a mismatch, so we keep a focus on performance
if adding new parameters, the function should keep the exact same behaviour when this parameter is left to its default value, so we don't get any conflict with the other tests
modify the less code as possible

Features included

1) KeywordProcessor.extract_keywords and KeywordProcessor.replace_keywords both have a new optional parameter : max_cost, which is the maximum levensthein distance accepted to perform fuzzy matching on a single keyword 2) KeywordProcessor implements a levensthein function, which tries to find a match with a given word, with respect to a provided max_cost, and returns a node in the trie, which will be used to continue the search 3) a new function has been included : get_next_word, which just retrieve the next word in the sequence. Tests are included in test/test_kp_next_word.py

Optimizations to keep focus on performance

We set the current_dict to the first node yielded by the levenshtein function, so we get back to static matching as soon as possible
We decrement the current cost (initialized to max_cost) every time we trigger fuzzy matching (on a word), so if all the cost from max_cost already have been "consumed" by other words in the current keyword, we do not trigger fuzzy matching. E.g : when trying to extract the keyword "here you are" from "heere you are" with a max_cost of 1, the current cost will gets all consumed after the first word ("heere"), so no fuzzy matching will be performed on the other words ("you" and "are")

Limitations

Well, as this is a pure python implementation don't expect this to be blazingly fast, even if we cut the recursion inside the levenshtein inner function ASAP, fuzzy matching still requires a lot more operations than exact matching

coveralls commented 5 years ago

Coverage increased (+0.1%) to 99.43% when pulling 9cc6d0b7363009006644cf9a36cd429c951c9d3e on Quantmetry:fuzzy_matching into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

spooknik commented 4 years ago

Hello! I tested your code and it works very well for me. One problem is the script throws an UnboundLocalError: local variable 'cost' referenced before assignmentif the input text ends with a symbol like a period, colon, etc. or of the text contains a a symbol like period or colon somewhere.

File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 656, in replace_keywords
    ({}, 0, 0)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 806, in _levenshtein_rec
    yield from self._levenshtein_rec(new_char, new_node, word, new_rows, max_cost, depth=depth + 1)
  [Previous line repeated 6 more times]
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 802, in _levenshtein_rec
    yield node, cost, depth
UnboundLocalError: local variable 'cost' referenced before assignment

remiadon commented 4 years ago

@spooknik I will try to update it ASAP For the moment I would say instantiate cost at the beginning at the function (lines 789-791) like

def _levenshtein_rec(self, char, node, word, rows, max_cost, depth=0): 
    n_columns = len(word) + 1
    new_rows = [rows[0] + 1]
    cost = 0

This is a mistake from my side, it's usually a bad practice to instantiate a variable in a scope (a for loop in this case) and access it out from the scope ...

spooknik commented 4 years ago

Thanks for the reply and the update. I've added cost = 0at line 792 but now there is another error:

  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 654, in replace_keywords
    current_dict_continued, cost, _ = next(
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\flashtext\keyword.py", line 801, in _levenshtein_rec
    stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'

remiadon commented 4 years ago

@spooknik do you have a test case to provide ?

spooknik commented 4 years ago

Sure thing. Using your version of keywords.py:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=60)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 654, in replace_keywords
    current_dict_continued, cost, _ = next(
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 786, in levensthein
    yield from self._levenshtein_rec(char, node, word, rows, max_cost, depth=1)
  File "/usr/lib/python3.8/site-packages/flashtext/keyword.py", line 801, in _levenshtein_rec
    stop_crit = node.keys() & (self._white_space_chars | {self._keyword})
AttributeError: 'str' object has no attribute 'keys'

It works if:

keyword_processor.replace_keywords('No. of colours', max_cost=60)
'Número de colores'

remiadon commented 4 years ago

@spooknik first thing first, in practice max_cost should not be that big, usually we allow for around 1 to 3 editions on strings if we want to stay robust. Are you sure it makes sense for you to keep a max_cost of 60 ? Anyway I'll try to fix it this week

spooknik commented 4 years ago

I think I misunderstood the value of this variable. 1-3 seems like a more sane value. I just want to catch some of the the different spelling variations in English (between UK and US for example).

Thanks for taking a look :+1:

spooknik commented 4 years ago

Hello again, small update to the bug report. Upon looking back into this It seems the problem is related with the max_cost variable. For example:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('No. of Colors', 'Número de colores')
keyword_processor.replace_keywords('No. of colours: 10', max_cost=1)
Número de colores: 10

but if I change max_cost=2 (or higher) then it crashes.

remiadon commented 4 years ago

@spooknik I added a test case for this, it should be OK now ;)

spooknik commented 4 years ago

Did some quick tests with my data and it works brilliantly now. Thank you so much 👍

zheyaf commented 4 years ago

Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost is a ratio rather than number of characters:)? A smaller than 3 max_cost works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!

spooknik commented 4 years ago

I'll chime in with my solution that I found.

I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.

if len(term) <= 3:
 keyword_processor.replace_keywords(term)
elif len(term) <= 8:
 keyword_processor.replace_keywords(term, max_cost=1)
else:
 keyword_processor.replace_keywords(term, max_cost=2)

I agree however max_ratio would be more direct perhaps :)

remiadon commented 4 years ago

Hi, a very good implementation of fuzzy match! A small proposal, would it be better if the max_cost is a ratio rather than number of characters:)? A smaller than 3 max_cost works fine for relatively long keywords (more than 8 characters), but when the keyword only contains 3 characters, it can basically match anything. It would be great if the max_cost can be max_ratio!

Good suggestion. I don't see exactly what you mean though. Would the ratio be relative to the length of the current word, or the length of the the keyword ? If I set max_ratio to 0.3 and try to extract "I do" in "I doo what I love", in the first case that would not match because the distance between "do" and "doo" is 1, divided by 3 (length of "doo") we get 33% I the second case that would work, because a distance of 1 relative to a length of 5 (length of "I doo") is equal to 20%, which is less than the max_ratio of 30%

zheyaf commented 4 years ago

Didn't think of that... Nice example! Based on the example you mentioned, it would make more sense if the max_ratio is relative to the length of the current word, which allows more matches to be returned. If the result is not ideal, we can always adjust the ratio to achieve higher accuracy.

remiadon commented 4 years ago

OK I see. @zheyaf do you have a few examples to provide, so I can make sure this leads to smarter fuzzy matching ?? Concerning the implementation, I should get something working without much effort

zheyaf commented 4 years ago

Hi! Is it ok that I send some examples to you email?

remiadon commented 4 years ago

@zheyaf my point was to use some of them as unittests potentially, can you use github instead ? so we get all the tracking of the PR in the same interface

zheyaf commented 4 years ago

@remiadon Hi, sorry, I'll put examples here.

s1 = 'XXX Beheer BV Omschrijving: Managemen t F ee april 2018 IBAN: NL12KNAB0222222222 Valutadatum: 24-04-2018'
s2 = 'XXX VAN DORT Omschrijving: VOORSCHOT SALARIS IBAN: NL34RABO888888888 Valutadatum: 04-02-2019'
keywords = ['sal', 'VPB', 'BTW', 'management fee']

The ideal match is exact match for 'sal', 'VPB', 'BTW', fuzzy match for 'management fee' in s1. Thanks!

remiadon commented 4 years ago

@zheyaf please provide a minimal example (eg. strip the useless parts in the strings), including function calls and parameters for the match to occur

nkrot commented 4 years ago

was this PR merged?

remiadon commented 4 years ago

@nkrot unfortunately no I have no answer from the maintainer, despite some people seem to be interested in this feature

vi3k6i5 commented 4 years ago

Hi,

sorry for so much delay on this, was busy with other stuff going on in life right now. Will take out time this weekend and get this done before Monday IST.

Thanks for your patience.

vi3k6i5 commented 4 years ago

merged to master.

vi3k6i5 commented 4 years ago

Thanks everyone for all the help with this, sorry for all the delay. Was inactive on Github for a long time cos of some personal reasons and life 😄

remiadon commented 4 years ago

I'll chime in with my solution that I found.

I wrote a simple conditional to check the length of the keyword and then use the relevant max_cost setting.
if len(term) <= 3:
 keyword_processor.replace_keywords(term)
elif len(term) <= 8:
 keyword_processor.replace_keywords(term, max_cost=1)
else:
 keyword_processor.replace_keywords(term, max_cost=2)
I agree however max_ratio would be more direct perhaps :)

@spooknik just for curiosity what was your exact example ?

bennyhawk commented 3 years ago

Hello @vi3k6i5 , would it be possible to build and push this commit to PyPi too?.. This is a very useful feature and I'm sure many developers would be more than excited to see this pushed to PyPi, me included 😃

lingvisa commented 2 years ago

@remiadon @vi3k6i5 Can you provide one example on how to enable the fuzzy match feature in the api?

remiadon commented 2 years ago

@lingvisa you can have a look in this PR "files changed" section, especially test/test_extract_fuzzy.py

Aside from this I think we still have to make some efforts to make fuzzy matching really usable. The main part to work on is to allow for a max_ratio arg to be passed, instead of the current max_cost, which matches everything for small keywords for example

lingvisa commented 2 years ago

@remiadon Have you also tested the performance of fuzzy match compared to exact match? Supposedly, will it be several times slower?

remiadon commented 2 years ago

@lingvisa I believe so, yes

vi3k6i5 / flashtext

Fuzzy matching #84

Guidelines

Features included

Optimizations to keep focus on performance

Limitations