seatgeek / thefuzz

Fuzzy String Matching in Python
MIT License
2.84k stars 137 forks source link

Inaccurate search in a large pandas table #14

Open DimIsaev opened 2 years ago

DimIsaev commented 2 years ago

How to optimize the search for a similar row in a large data table ~ 1million records

pre-vectorization of each line?

%%time
process.extractOne(text, df['title'])

image

maxbachmann commented 2 years ago

Probably the easiest way to improve the performance would be to use RapidFuzz instead of thefuzz, which uses a significantly faster implementation of the same algorithm. Since it is mostly API compatible, it only requires you to change the import from from thefuzz import process to from rapidfuzz import process.