rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

Support arbitrary column names if apriori is used with SparseDataFrame #501

Closed nshahHome closed 5 years ago

nshahHome commented 5 years ago

I am running python 2.7 in anaconda and have installed mlxtend. Based on the latest version of mlxtend, the aprioir class supports sparse dataframe as its input. I have over 500k products that I want to run a market basket analysis on.

I have created a onehot encoded sparse dataframe using a small dataset to test but I am running into df.to_coo() issue on the sparse data frame inside the mlextend apriori function.

Please find the code, the input data file and the errors I get here - https://github.com/nshahHome/pycode

Click on the view code to see the files.

code = code2.py , input data file= mbatest.txt , errors = code2-error.html (pdf version) , condalist.txt

I expect the code to not throw errors and try to create frequent_itemsets. The set could be empy if there are no sets > min_support.

rasbt commented 5 years ago

Hi there,

it seems the current limitation for sparse dataframes is currently that the column indices have to be in consecutive order. I.e.

df.columns = range(len(df.columns))
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

would probably fix the issue. However, it would be good to make it compatible with arbitrary column names in future, so let's leave this issue open. Tagging @DanielMorales9 who initially added SparseDataFrame support in hope he might have some good suggestions about this.

PS: I highly recommend using Python 3, we dropped Python 2.7 support ~ 1 year ago. The code may still work fine, but it's not included in the continuous integration tests anymore and could produce unexpected results.

nshahHome commented 5 years ago

Hello Rasbt,

Your suggestion of using the range to order the column indices worked for my test case.

I have also tried it on cases where column names (product names are numeric - typically most of the products in large implementations are represented using numeric ids) and it works.

Thank you for your support.

Best Regards, Nitin.

DanielMorales9 commented 5 years ago

Hi All,

I reviewed the code. I don't think it's a mlxtend's bug but rather a pandas limitation. It seems that it tries to infer the columns types by looking up the first column (i.e. the one at position 0), but when the columns are named after integers it ends up looking the column with name zero.
Simply stringifying the columns and indexes as follows solves the problem:

df = pd.SparseDataFrame(coo1, \
                         index=[str(i) for i in TRX_ID_c.categories], \
                         columns=[str(c) for c in PRODUCT_c.categories], \
                         default_fill_value=0)

Personally, I do not see how the apriorialgorithm should handle this pandas dirty check (which can be easily solved through a bit of debugging). I would rather keep the implementation like this and raising an issue to pandas.

rasbt commented 5 years ago

@DanielMorales9 Thanks for looking into it! Good point, it seems to be a pandas limitation. E.g.,

import numpy as np
import pandas as pd

ary = np.array([ [1, 0, 0, 3],
                 [1, 0, 2, 0],
                 [0, 4, 0 ,0] ])

df = pd.DataFrame(ary)
df.columns = [1, 2, 3, 4]

dfs = pd.SparseDataFrame(df,
                         default_fill_value=0)

The following won't work:

dfs.to_coo().tocsc()

But either

dfs2 = dfs.copy()
dfs2.columns = [0, 1, 2, 3]
dfs2.to_coo().tocsc()

or

dfs3 = dfs.copy()
dfs3.columns = [str(i) for i in dfs3.columns]
dfs3.to_coo().tocsc()

seems to be working fine. Will open an issue about that on the pandas issue tracker.

In mlxtend, maybe what we should do is to check at the beginning of the function if column names are integer types if the dataframe is sparse and than raise a warning/issue that due to current limitation in pandas, the user should pass a dataframe with str column names so that it's more obvious what's going on (because the current "keyerror 0" message looks like an apriori bug otherwise and may be confusing)

@nshahHome

Glad that it solves your issue. Instead of renumbering the columns, you may also consider just passing them as strings like suggested by @DanielMorales9 so the results are more interpretable in your case (in case the column names have special meanings in your use case scenario)