Closed al-yakubovich closed 1 year ago
cdist
returns a matrix of len(queries) x len(choices) x size(dtype)
. By default this dtype is float
or int32_t
depending on the scorer (for the default scorer you are using it is float
). So for 1 million names, the result matrix would require 3.6 terrabyte of memory.
You will need to process your data in smaller chunks and store the results on disk in between.
Hi, the following code gives a memoryError :
on line:
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
ifdf_test
is changed with dataframe with 1 million rows. My PC has 12GB of free RAM space. Any ideas how to avoid this error?