rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.83k stars 855 forks source link

Error Encoding Large Dataset - Unable to allocate X for an array with shape Y and data type bool #948

Closed migueldft closed 2 years ago

migueldft commented 2 years ago

I am trying to use apriori algorithm in a large ecommerce dataset.It is around 300k products and 2M orders.

My first step was making a list of products for each order: user_items = df.groupby('sale_order_store_number')['sku_config'].apply(list)

after that I tryied to use the Encoder

te = TransactionEncoder()
te_ary = te.fit(list(user_items)).transform(list(user_items))
encoded = pd.DataFrame(te_ary, columns=te.columns_)

which gives me the memory error Unable to allocate 899. GiB for an array with shape (2724244, 354208) and data type bool

Anything I can do to avoid this kind of problem ? Thanks in advance

rasbt commented 2 years ago

Oh wow, that's a big dataset. Not sure if it will be possible to handle with the current implementation. However, you could try

fit_transform(X, sparse=True)

I.e.,

te = TransactionEncoder()
te_ary = te.fit_transform(list(user_items), sparse=True)

and then if that worked, maybe use fpmax instead of apriori

migueldft commented 2 years ago

The transform part now works perfectly !

But how can I procced after that ? Do I still need to convert it to a DataFrame for applying appriori or fpmax algorithms ? If not, could you provide some example?

Converting it back to a df leads me to the same storage error

rasbt commented 2 years ago

Ah, right. You can use a sparse DataFrame. E.g.,

df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
frequent_itemsets = fpmax(df, min_support=0.6, use_colnames=True)

I should probably document that.