seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

The behavior of partial_ratio is unexpected #224

Open refraction-ray opened 5 years ago

refraction-ray commented 5 years ago

fuzzywuzzy == 0.17.0 with Python 3.6 See instances below.

summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'

from fuzzywuzzy import fuzz
# below is the group of matching with keywords exactly in the summary text

(
fuzz.partial_ratio("quantity", summary), 
fuzz.partial_ratio("measurements", summary), 
fuzz.partial_ratio("five partners on a single device", summary),
fuzz.partial_ratio("opened", summary), 
fuzz.partial_ratio("and", summary),  
fuzz.partial_ratio("reproducibility problems", summary)
)
# output (100, 33, 38, 17, 0, 33)

# below is the group of matching with keywords not in the summary text
(
fuzz.partial_ratio("noway", summary), 
fuzz.partial_ratio("no such way", summary), 
fuzz.partial_ratio("buy and sell", summary)
)
# output (20,27,33)

I can't fully understand the behavior of partial_ratio, based on the above experiments, it seems the returned matching scores are not so consistent.

refraction-ray commented 5 years ago

Some updates. Firstly, some thoughts on the implementation of partial_ratio in https://github.com/seatgeek/fuzzywuzzy/blob/778162c5a73256745eb6ae22f925bc2dbcf7c894/fuzzywuzzy/fuzz.py#L56-L59 There is the possibility that the only block in blocks is of the form (len(shorter), len(longer), 0), i.e. there is not even one char matched between the two strings. In my humble opinion, the score returned in this case should be directly zero, instead of further comparison as the above code. (Of course, if the blocks returned by get_matching_blocks is correct, then the further comparison above also returns zero which should be consistent.)

Secondly, the unexpected behavior of partial_ratio is from difflib.SequenceMatcher. See instance below.

from difflib import SequenceMatcher
# the same long string
summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'

m = SequenceMatcher(None, "and", summary)
m.get_matching_blocks()
# output
# [Match(a=3, b=602, size=0)]
# though "and" is in the long string of summary, the get_matching_blocks function fails to capture it
# and return only one Match with size=0 indicating no match for even one char.

By track this, I finally located the origin of the problem: the autojunk option in difflib.SequenceMatcher class. For long string (len>200), autojunk should set to be False, otherwise, quoted from the docstring

Optional arg autojunk should be set to False to disable the "automatic junk heuristic" that treats popular elements as junk.

By reading the source code of difflib, all chars in the string with occurrence more than 1% would be automatically deleted from the string if autojunk=True which is the default behavior!

In sum, to match keywords in long strings, we should perhaps provide the option of autojunk in the partial_ratio function. After locating the origin of the issue, I find it replicated with https://github.com/seatgeek/fuzzywuzzy/issues/214

rbrand21 commented 5 years ago

Curious if autoJunk will be added as an option to partial_ratio? I have same issue.