seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.2k stars 878 forks source link

process.extractOne does not match fuzz.ratio #288

Open Pedro-Saad opened 3 years ago

Pedro-Saad commented 3 years ago

Using the process.extractOne and fuzz.ratio give different results in this case:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

stringToMatch = 'Florinia-SP'
possibleResults = ['São Bernado do Campo-SP', 'Florínea-SP']
print(fuzz.ratio(stringToMatch,possibleResults[0]))
print(fuzz.ratio(stringToMatch,possibleResults[1]))
print(process.extract(stringToMatch,possibleResults))

While the individual fuzz.ratio give correct results (41 for the lowest score and 82 for the highest score), the process.extract gives 86 for both of them.

teste.zip

maxbachmann commented 3 years ago

These are the docs of process.extract:

Select the best match in a list or dictionary of choices. Find best matches in a list or dictionary of choices, return a list of tuples containing the match and its score. If a dictionary is used, also returns the key for each match. Arguments: query: An object representing the thing we want to find. choices: An iterable or dictionary-like object containing choices to be matched against the query. Dictionary arguments of {key: value} pairs will attempt to match the query against each value. processor: Optional function of the form f(a) -> b, where a is the query or individual choice and b is the choice to be used in matching. This can be used to match against, say, the first element of a list: lambda x: x[0] Defaults to fuzzywuzzy.utils.full_process(). scorer: Optional function for scoring matches between the query and an individual processed choice. This should be a function of the form f(query, choice) -> int. By default, fuzz.WRatio() is used and expects both query and choice to be strings. limit: Optional maximum for the number of elements returned. Defaults to 5. Returns: List of tuples containing the match and its score. If a list is used for choices, then the result will be 2-tuples. If a dictionary is used, then the result will be 3-tuples containing the key for each match. For example, searching for 'bird' in the dictionary {'bard': 'train', 'dog': 'man'} may return [('train', 22, 'bard'), ('man', 0, 'dog')]

They state, that the default scorer for process.extract is fuzz.WRatio, which will give different results than fuzz.ratio. If you want to use fuzz.ratio you can specify this using the scorer argument. Beside this fuzz.ratio does not preprocess strings before matching them, while process.extract does preprocess them by default using fuzzywuzzy.utils.full_process(). So if you want to have similar results to fuzz.ratio this behaviour should be disabled using the processor argument.

process.extract(stringToMatch, possibleResults, scorer=fuzz.ratio, processor=None)

Other process functions like process.extractOne use similar defaults.