Open pagpires opened 4 years ago
I ran into this, it can cause two strings that only have the word "the" in common to get an 86 point match using WRatio / process.extract. This is not very useful...
In [6]: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2")
...: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2 video")
...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2")
...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2 video")
...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2")
...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2 video")
...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2")
...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2 video")
Out[6]: 77
Out[6]: 77
Out[6]: 100
Out[6]: 100
Out[6]: 82
Out[6]: 82
Out[6]: 69
Out[6]: 100
This behavior seems broken to me, as only partial_ratio
sees "study physics physics 2 video"
as the better alternative.
According to the algorithm, currently
partial_token_set_ratio
will return 100 for string a, b as long as they have a shared token, right? Is this behavior expected?