Open refraction-ray opened 5 years ago
Some updates.
Firstly, some thoughts on the implementation of partial_ratio
in https://github.com/seatgeek/fuzzywuzzy/blob/778162c5a73256745eb6ae22f925bc2dbcf7c894/fuzzywuzzy/fuzz.py#L56-L59
There is the possibility that the only block
in blocks
is of the form (len(shorter), len(longer), 0)
, i.e. there is not even one char matched between the two strings. In my humble opinion, the score returned in this case should be directly zero, instead of further comparison as the above code.
(Of course, if the blocks
returned by get_matching_blocks
is correct, then the further comparison above also returns zero which should be consistent.)
Secondly, the unexpected behavior of partial_ratio
is from difflib.SequenceMatcher
. See instance below.
from difflib import SequenceMatcher
# the same long string
summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'
m = SequenceMatcher(None, "and", summary)
m.get_matching_blocks()
# output
# [Match(a=3, b=602, size=0)]
# though "and" is in the long string of summary, the get_matching_blocks function fails to capture it
# and return only one Match with size=0 indicating no match for even one char.
By track this, I finally located the origin of the problem: the autojunk
option in difflib.SequenceMatcher
class. For long string (len>200), autojunk
should set to be False, otherwise, quoted from the docstring
Optional arg autojunk should be set to False to disable the "automatic junk heuristic" that treats popular elements as junk.
By reading the source code of difflib, all chars in the string with occurrence more than 1% would be automatically deleted from the string if autojunk=True
which is the default behavior!
In sum, to match keywords in long strings, we should perhaps provide the option of autojunk
in the partial_ratio
function. After locating the origin of the issue, I find it replicated with https://github.com/seatgeek/fuzzywuzzy/issues/214
Curious if autoJunk will be added as an option to partial_ratio
? I have same issue.
fuzzywuzzy == 0.17.0 with Python 3.6 See instances below.
I can't fully understand the behavior of
partial_ratio
, based on the above experiments, it seems the returned matching scores are not so consistent.