rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.71k stars 119 forks source link

Use default processor when supplying a scorer to process.cdist #306

Closed andybrnr closed 1 year ago

andybrnr commented 1 year ago

This isn't a bug, per se - more default behavior inconsistencies that caused confusion. I generated a distance matrix using from two lists of strings by means of process.cdist(list1, list2, scorer=fuzz.WRatio). I was initially perplexed to see that several entries in the resulting matrix differed from directly invoking fuzz.WRatio, i.e. cdist_matrix[n, m] != fuzz.WRatio(list1[n], list2[m]). After reading the docs again, I realized that the default argument is processor=None for process.cdist, while processor=utils.default_process is the default when directly calling fuzz.WRatio(item1, item2). It would be a little nicer if cdist would automatically pickup the processor default for a given scorer (if it is not None), but otherwise perhaps having a universal default of processor=None would create less confusion. I guess "RTFM" is also a fair response, tho :)

maxbachmann commented 1 year ago

I agree that this inconsistency is confusing and you are not the first one surprised by the default behavior of the preprocessing function. This library is based upon fuzzywuzzy and I wanted the behavior to stay as close as possible to make it a drop in replacement. For this reason I selected the same default behavior for preprocessing when implementing fuzz.*/process.extract/process.extractOne/process.extract_iter. Later I added more algorithms, where I did not consider preprocessing a good default and so I started to disable it in any functions which do not exist in fuzzywuzzy.

I was already thinking about simply changing the default everywhere to processor=None in the future to make this more consistent. This would be another difference to the way fuzzywuzzy behaves, but I think it would make using the library easier in the future and would ultimately be worth it. In addition it is a breaking change and therefor can not occur before version 3.0. I should be able to add a warning for anyone currently relying on the default behavior, so they can change the way they call these functions.

andybrnr commented 1 year ago

Thanks for the response, @maxbachmann (and for implementing and maintaining this fantastic library)! processor=None default across the board seems like a good solution, and understand the desire for gradual migration away from fuzzywuzzy default behavior.

maxbachmann commented 1 year ago

This is implemented in https://github.com/maxbachmann/RapidFuzz/pull/317

andybrnr commented 1 year ago

Awesome! Thanks so much for continuing to develop this.