rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 856 forks source link

Is there a way to use the TransactionEncoder and FP Growth with a large CSV? #584

Open jnguyen32 opened 5 years ago

jnguyen32 commented 5 years ago

Per the example:

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

What's the best way to use chunking so that we can get a transformed dataframe that can be used for rule mining?

rasbt commented 5 years ago

Currently, the implementations of the TransactionEncoder and frequent itemset mining algorithms don't support chunking.

What may help though is using a sparse dataframe for frequent itemset and rule mining. For example, if you set .transform(X, sparse=True) for the TransactionEncoder, it will return a sparse DataFrame.

rasbt commented 5 years ago

It just occurs to me that sth like Dask dataframes, which have out-of-core support, could also work, but I have not tested this -- currently, we only use pandas DataFrames for testing