Closed ywzhang188 closed 4 years ago
This is an interesting question. Haven't tried the apriori func with Dask dataframes as drop-in replacement for pandas DataFrames, yet. However, maybe someone else has some useful tips!?
Also, how about trying it on a non-sparse DataFrame first. If it's not possible due to memory limitations, you can maybe experiment with a subset for until you get the code to work.
i also have question about this, you are having large scale data, i had only 50 rules; 2 columns, [buying item], [recommend] in a csv file
for recommendation before shopping cart payment, but i doesn't know to apply it effectively, system engineer feedback is the backend need to search for the whole file to find exact match which will have high loadtime
how to apply apriori in real case?please share some experience, did i approach the matter wrongly?
i also have question about this, you are having large scale data, i had only 50 rules; 2 columns, [buying item], [recommend] in a csv file
for recommendation before shopping cart payment, but i doesn't know to apply it effectively, system engineer feedback is the backend need to search for the whole file to find exact match which will have high loadtime
how to apply apriori in real case?please share some experience, did i approach the matter wrongly?
I think you just need to feed in transaction data. If you don't have it, then you should transfer your data into transaction arrays. And I am not sure how you want to deal with recommend filed.
I am trying dask to solve the memory issue when doing assoication rules which need to transfer trasaction data into large scale one hot data. But somehow, I did not how to make it. This is what I have tried.
First, I don't know how to transfer sparse matrix into dask dataframe. Second, I am not sure use map_partitions is a right func choice
from mlxtend.preprocessing import TransactionEncoder from dask.array as da from dask.dataframe as dd from mlxtend.frequent_patterns import apriori
te = TransactionEncoder() te_ary = te.fit(df).transform(df_loader, sparse=True) test = dd.from_dask_array(da.from_array(teary, chunks=10000), columns=te.columns) ddf_out = test.map_partitions(lambda df: df.assign(result=apriori(df, use_colnames=True)))
Note that there is also an implementation of fpgrowth
and fpmax
which are both faster and more efficient that apriori. Have you tried those to see if they help with your performance problems?
Note that there is also an implementation of
fpgrowth
andfpmax
which are both faster and more efficient that apriori. Have you tried those to see if they help with your performance problems?
Thank you for the notice. The main problem for me is how I use sparse matrix to generate dask dataframe? since the biggest issue for me is memory
I am not sure how to do that in Dask to be honest -- have never tried. Maybe you could try to get some answers and tips via stackoverfllow? If you find a solution, please let us know here, then we can post it as an example in the documentation.
I am trying dask to solve the memory issue when doing assoication rules which need to transfer trasaction data into large scale one hot data. But somehow, I did not how to make it. This is what I have tried.
First, I don't know how to transfer sparse matrix into dask dataframe. Second, I am not sure use map_partitions is a right func choice
from mlxtend.preprocessing import TransactionEncoder from dask.array as da from dask.dataframe as dd from mlxtend.frequent_patterns import apriori
te = TransactionEncoder() te_ary = te.fit(df).transform(df_loader, sparse=True) test = dd.from_dask_array(da.from_array(teary, chunks=10000), columns=te.columns) ddf_out = test.map_partitions(lambda df: df.assign(result=apriori(df, use_colnames=True)))
###############
I tries this today, till give me memory issue traceback
sdf = pd.SparseDataFrame(teary,columns=te.columns) frequent_itemsets = apriori(sdf, min_support=0.01, use_colnames=True)
Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in
frequent_itemsets = apriori(sdf, min_support=0.01, use_colnames=True)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/mlxtend/frequent_patterns/apriori.py", line 146, in apriori
idxs = np.where((df.values != 1) & (df.values != 0))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 5444, in values
return self._data.as_array(transpose=self._AXIS_REVERSED)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 821, in as_array
arr = mgr._interleave()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 839, in _interleave
result = np.empty(self.shape, dtype=dtype)
MemoryError