rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

memoryError: bad allocation for rapidfuzz.process.cdist #302

Closed al-yakubovich closed 1 year ago

al-yakubovich commented 1 year ago

Hi, the following code gives a memoryError :

from rapidfuzz import process, fuzz
import pandas as pd

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish', 'Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")

on line: scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names) if df_test is changed with dataframe with 1 million rows. My PC has 12GB of free RAM space. Any ideas how to avoid this error?

maxbachmann commented 1 year ago

cdist returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float). So for 1 million names, the result matrix would require 3.6 terrabyte of memory.

You will need to process your data in smaller chunks and store the results on disk in between.