rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.71k stars 119 forks source link

`process.extractBests` and usage of `__str__` #332

Closed banagale closed 1 year ago

banagale commented 1 year ago

I am trying to drop-in replace a project that depends on the last version of fuzzywuzzy prior to the name change. This is needed after hitting this issue.

The project uses process.extractBests. I noticed that rapidfuzz does not include process.extractBests.

Is process.extract a drop in replacement for that old function?

I tried using process.extract and realized that the project was relying on the __str__ of objects passed into the choices argument being read. Later in the code, the variable is used like an object. (this allowed the dev to easily use the object and refer to the string for comparison)

rapidfuzz does not seem to look at a given __str__ for an object. Is this on purpose? Or perhaps FW should not have done this?

I mention the two above because I believe the goal is for the FW api to be fully available in RF. I do not know if the above use of the FW api was unusual or an anti-pattern though.

maxbachmann commented 1 year ago

In fuzzywuzzy there is both extract and extractBests with the difference that extractBests has an additional score_cutoff parameter. In RapidFuzz I only have the extract function which does provide the score_cutoff argument and so is equivalent to extractBests

There are a couple of differences between RapidFuzz and fuzzywuzzy. In your specific case I assume you are using a function like WRatio which defaults to force_ascii=True. So your strings are preprocessed using utils.full_process(, force_ascii=True) which runs str(sequence). This behaviour is not supported in rapidfuzz, so you will need to perform this conversion yourself. This can be done e.g. like this:

process.extract(query, choices, processor=str)

or in case you want to use the preprocessing function:

def preprocess(seq):
    return utils.default_process(str(seq))

process.extract(query, choices, processor=preprocess)
banagale commented 1 year ago

Thank you for that feedback, Max!

I'll have another run at this, and if I run into difficulty re-open this issue. I saw #333 and appreciate that, I had seen #26 and presumed the function previously existed but was obviated.