Open gamesguru opened 5 years ago
You should provide examples for people to reproduce your test. Some of the interesting points:
Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt
As for the other points,
token_set_ratio
, and though it appears to be the slowest.. only by a factor of 1.5-2x, which means ratio
or simple_ratio
is still taking up to tens of seconds.python-Levenshtein
. I dispelled any such warning messages early on.I performed a quick test using the following test code:
setup="""
from {} import process, fuzz
with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
text = fw.read()
words = text.split()
query = words[0]
words = words[1:]
"""
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
))
This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz The runtime I got was:
RapidFuzz: 1.22120649999124 sec
FuzzyWuzzy: 8.835493099992163 sec
So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm.
takes 30 seconds to process 1.1 million product names
the npm library fuzzysort is much faster currently