seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.21k stars 874 forks source link

process is slow #239

Open gamesguru opened 5 years ago

gamesguru commented 5 years ago

takes 30 seconds to process 1.1 million product names

the npm library fuzzysort is much faster currently

maxbachmann commented 3 years ago

You should provide examples for people to reproduce your test. Some of the interesting points:

gamesguru commented 3 years ago

Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt

As for the other points,

  1. I used token_set_ratio, and though it appears to be the slowest.. only by a factor of 1.5-2x, which means ratio or simple_ratio is still taking up to tens of seconds.
  2. Yes, I have installed python-Levenshtein. I dispelled any such warning messages early on.
  3. Not exactly sure, afaik she has implemented a custom in-place algorithm: https://github.com/farzher/fuzzysort/blob/master/fuzzysort.js
maxbachmann commented 3 years ago

I performed a quick test using the following test code:

setup="""
from {} import process, fuzz

with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
  text = fw.read()

words = text.split()
query = words[0]
words = words[1:]
"""

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
))

This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz The runtime I got was:

RapidFuzz: 1.22120649999124 sec
FuzzyWuzzy: 8.835493099992163 sec

So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm.