rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

How to apply apriori association rules on dask dataframe? #569

Closed ywzhang188 closed 4 years ago

ywzhang188 commented 5 years ago

I am trying dask to solve the memory issue when doing assoication rules which need to transfer trasaction data into large scale one hot data. But somehow, I did not how to make it. This is what I have tried.

First, I don't know how to transfer sparse matrix into dask dataframe. Second, I am not sure use map_partitions is a right func choice

from mlxtend.preprocessing import TransactionEncoder from dask.array as da from dask.dataframe as dd from mlxtend.frequent_patterns import apriori

te = TransactionEncoder() te_ary = te.fit(df).transform(df_loader, sparse=True) test = dd.from_dask_array(da.from_array(teary, chunks=10000), columns=te.columns) ddf_out = test.map_partitions(lambda df: df.assign(result=apriori(df, use_colnames=True)))

###############

I tries this today, till give me memory issue traceback

sdf = pd.SparseDataFrame(teary,columns=te.columns) frequent_itemsets = apriori(sdf, min_support=0.01, use_colnames=True)

Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in frequent_itemsets = apriori(sdf, min_support=0.01, use_colnames=True) File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/mlxtend/frequent_patterns/apriori.py", line 146, in apriori idxs = np.where((df.values != 1) & (df.values != 0)) File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 5444, in values return self._data.as_array(transpose=self._AXIS_REVERSED) File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 821, in as_array arr = mgr._interleave() File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 839, in _interleave result = np.empty(self.shape, dtype=dtype) MemoryError

rasbt commented 5 years ago

This is an interesting question. Haven't tried the apriori func with Dask dataframes as drop-in replacement for pandas DataFrames, yet. However, maybe someone else has some useful tips!?

Also, how about trying it on a non-sparse DataFrame first. If it's not possible due to memory limitations, you can maybe experiment with a subset for until you get the code to work.

ZhengTzer commented 5 years ago

i also have question about this, you are having large scale data, i had only 50 rules; 2 columns, [buying item], [recommend] in a csv file

for recommendation before shopping cart payment, but i doesn't know to apply it effectively, system engineer feedback is the backend need to search for the whole file to find exact match which will have high loadtime

how to apply apriori in real case?please share some experience, did i approach the matter wrongly?

ywzhang188 commented 5 years ago

i also have question about this, you are having large scale data, i had only 50 rules; 2 columns, [buying item], [recommend] in a csv file

for recommendation before shopping cart payment, but i doesn't know to apply it effectively, system engineer feedback is the backend need to search for the whole file to find exact match which will have high loadtime

how to apply apriori in real case?please share some experience, did i approach the matter wrongly?

I think you just need to feed in transaction data. If you don't have it, then you should transfer your data into transaction arrays. And I am not sure how you want to deal with recommend filed.

ywzhang188 commented 5 years ago

I am trying dask to solve the memory issue when doing assoication rules which need to transfer trasaction data into large scale one hot data. But somehow, I did not how to make it. This is what I have tried.

First, I don't know how to transfer sparse matrix into dask dataframe. Second, I am not sure use map_partitions is a right func choice

from mlxtend.preprocessing import TransactionEncoder from dask.array as da from dask.dataframe as dd from mlxtend.frequent_patterns import apriori

te = TransactionEncoder() te_ary = te.fit(df).transform(df_loader, sparse=True) test = dd.from_dask_array(da.from_array(teary, chunks=10000), columns=te.columns) ddf_out = test.map_partitions(lambda df: df.assign(result=apriori(df, use_colnames=True)))

rasbt commented 5 years ago

Note that there is also an implementation of fpgrowth and fpmax which are both faster and more efficient that apriori. Have you tried those to see if they help with your performance problems?

ywzhang188 commented 5 years ago

Note that there is also an implementation of fpgrowth and fpmax which are both faster and more efficient that apriori. Have you tried those to see if they help with your performance problems?

Thank you for the notice. The main problem for me is how I use sparse matrix to generate dask dataframe? since the biggest issue for me is memory

rasbt commented 5 years ago

I am not sure how to do that in Dask to be honest -- have never tried. Maybe you could try to get some answers and tips via stackoverfllow? If you find a solution, please let us know here, then we can post it as an example in the documentation.