Association rules blows up memory and fails

vopani commented 3 years ago

Describe the bug

Running FP-Growth + Association Rules on two similar datasets, one runs successfully in 3mins using ~4GB RAM, other blows up to 16GB RAM and crashes.

Datasets

The two datasets have nothing really different, they are just randomly sampled from this dataset

data_1 shape is (17091, 3171) data_2 shape is (17419, 3194)

datasets.zip

Steps/Code to Reproduce

import pandas as pd
from mlxtend.frequent_patterns import fpgrowth, association_rules

data_1 = pd.read_csv('sample_1.csv', index_col=0)
data_2 = pd.read_csv('sample_2.csv', index_col=0)

ar_1 = association_rules(fpgrowth(df=data_1, min_support=1/data_1.shape[0], use_colnames=True), min_threshold=0)
ar_2 = association_rules(fpgrowth(df=data_2, min_support=1/data_2.shape[0], use_colnames=True), min_threshold=0)

Expected Results

Both should run similarly.

Actual Results

ar_1 runs within 5mins peaking at 4GB RAM usage. ar_2 blows up memory to 16GB RAM and crashes.

Versions

>>> import mlxtend; print("MLxtend", mlxtend.__version__)
MLxtend 0.18.0
>>> import platform; print(platform.platform())
macOS-10.16-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.8.2 (default, Mar 11 2020, 00:29:50) 
[Clang 11.0.0 (clang-1100.0.33.17)]
>>> import sklearn; print("Scikit-learn", sklearn.__version__)
Scikit-learn 0.24.2
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.0
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.6.3
>>> import pandas; print("Pandas", pandas.__version__)
Pandas 1.2.5

Comments

Thanks a lot for creating and maintaining this library.

rasbt commented 3 years ago

Thanks for mentioning that and providing reproducible scripts. On my computer, I had similar issues. Not sure if it is necessarily a bug or just due to memory limitations given that it's a relatively large dataset.

E.g., when I increased the support threshold by a factor of 10

min_support=1/data_1.shape[0]*10

the code from 'sample_1.csv' runs in under 4 seconds, and the code for 'sample_2.csv' under 5 seconds

vopani commented 3 years ago

I'm keen to exactly understand this issue (or to be convinced on the limitations on large datasets). I know using the support threshold helps in limiting the size of products (and data in general), but I'm working on a particular project that requires it to be run as is on the whole dataset.

I investigated this a bit further and narrowed it down to a much much smaller dataset.

datasets.zip

Could you take a look into this one? (The same code as above can be used to test run) The two datasets have shapes data_1: (20, 15) and data_2: (22, 17)

I'm wondering why data_1 works fine but data_2 even being this small, still goes up to 16GB memory and crashes.

rasbt commented 2 years ago

I was able to run both datasets. The first dataset finishes immediately, resulting in 2006 association rules. The second one seems to use more memory and takes like 10 seconds, but the resulting data frame has 4752394 rows. The large number of association rules for this one might be causing the memory blow up. Btw. I get the same results with apriori, so I don't think there is a bug in fpgrowth.

vopani commented 2 years ago

I managed to test my actual dataset on a massive 1TB RAM machine and sure, it did finish after a few days and using close to 400GB RAM 😄

Maybe there are just too many combinations that take time (typical problem for Market Basket datasets).

The memory blowing up is likely due to using of frozensets, which requires 3x-10x more memory than list of characters or integers (The frozenset columns itself use 90%+ of the memory). I do understand the need for it though. I'll probably experiment with some changes in the source code to avoid frozensets that suit my need.

@rasbt Thank You so much for checking and your time.

rasbt commented 2 years ago

Glad you got it to run, and wow, yeah, that's definitely a big memory hog. Regarding the frozensets, that's a bummer that they take up so much memory. It's been a long time, but I remember having chosen frozensets because it improved efficiency compared to regular sets. In case you compare the two again in terms of memory needs and find that regular sets perform better, please let me know, I am happy to reconsider implementation choices.

armenabnousi commented 1 year ago

this seems relevant here

rasbt / mlxtend