Closed andybrnr closed 1 year ago
I agree that this inconsistency is confusing and you are not the first one surprised by the default behavior of the preprocessing function. This library is based upon fuzzywuzzy
and I wanted the behavior to stay as close as possible to make it a drop in replacement. For this reason I selected the same default behavior for preprocessing when implementing fuzz.*
/process.extract
/process.extractOne
/process.extract_iter
. Later I added more algorithms, where I did not consider preprocessing a good default and so I started to disable it in any functions which do not exist in fuzzywuzzy
.
I was already thinking about simply changing the default everywhere to processor=None
in the future to make this more consistent. This would be another difference to the way fuzzywuzzy
behaves, but I think it would make using the library easier in the future and would ultimately be worth it. In addition it is a breaking change and therefor can not occur before version 3.0. I should be able to add a warning for anyone currently relying on the default behavior, so they can change the way they call these functions.
Thanks for the response, @maxbachmann (and for implementing and maintaining this fantastic library)! processor=None
default across the board seems like a good solution, and understand the desire for gradual migration away from fuzzywuzzy
default behavior.
This is implemented in https://github.com/maxbachmann/RapidFuzz/pull/317
Awesome! Thanks so much for continuing to develop this.
This isn't a bug, per se - more default behavior inconsistencies that caused confusion. I generated a distance matrix using from two lists of strings by means of
process.cdist(list1, list2, scorer=fuzz.WRatio)
. I was initially perplexed to see that several entries in the resulting matrix differed from directly invoking fuzz.WRatio, i.e. cdist_matrix[n, m] != fuzz.WRatio(list1[n], list2[m]). After reading the docs again, I realized that the default argument isprocessor=None
forprocess.cdist
, whileprocessor=utils.default_process
is the default when directly callingfuzz.WRatio(item1, item2)
. It would be a little nicer ifcdist
would automatically pickup theprocessor
default for a givenscorer
(if it is notNone
), but otherwise perhaps having a universal default ofprocessor=None
would create less confusion. I guess "RTFM" is also a fair response, tho :)