rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.71k stars 119 forks source link

Updated Readme #324

Closed dheerajck closed 1 year ago

dheerajck commented 1 year ago

Updated readme

Added examples of WRatio, QRatio and updated score values Added examples of string preprocessing

dheerajck commented 1 year ago

Dont you think that the parameter name processor can be confusing, and something like string_preprocessor would be a better name ??

maxbachmann commented 1 year ago

Dont you think that the parameter name processor can be confusing, and something like string_preprocessor would be a better name ??

I agree it is not a perfect name. The naming stems from fuzzywuzzy using the named argument processor in their process.* APIs. I added the argument to every scorer, which in hindsight wasn't a great idea. It saves the user very little typing:

Levenshtein.distance(s1, s2, processor=utils.default_process)

vs

Levenshtein.distance(utils.default_process(s1), utils.default_process(s2))

in addition the performance difference is pretty small. For short sequences <16 characters the second implementation appears a couple percent faster and for longer ones calling it internally appears to be around 10% faster. So it only makes a difference when working with very fast scorers like Prefix/Postfix/Hamming and long sequences. Even then when comparing multiple sequences your better off using the scorer with the process.* APIs.

For the process.* APIs that is a different story, since: 1) it saves more typing 2) I am able to call the preprocessing function in a more performant way

For these reasons I was actually playing with the thought of deprecating the processor argument in scorers.