Closed vopani closed 2 years ago
Thanks for mentioning that and providing reproducible scripts. On my computer, I had similar issues. Not sure if it is necessarily a bug or just due to memory limitations given that it's a relatively large dataset.
E.g., when I increased the support threshold by a factor of 10
min_support=1/data_1.shape[0]*10
the code from 'sample_1.csv'
runs in under 4 seconds, and the code for 'sample_2.csv'
under 5 seconds
I'm keen to exactly understand this issue (or to be convinced on the limitations on large datasets). I know using the support threshold helps in limiting the size of products (and data in general), but I'm working on a particular project that requires it to be run as is on the whole dataset.
I investigated this a bit further and narrowed it down to a much much smaller dataset.
Could you take a look into this one? (The same code as above can be used to test run)
The two datasets have shapes data_1: (20, 15)
and data_2: (22, 17)
I'm wondering why data_1
works fine but data_2
even being this small, still goes up to 16GB memory and crashes.
I was able to run both datasets. The first dataset finishes immediately, resulting in 2006 association rules. The second one seems to use more memory and takes like 10 seconds, but the resulting data frame has 4752394 rows. The large number of association rules for this one might be causing the memory blow up. Btw. I get the same results with apriori
, so I don't think there is a bug in fpgrowth
.
I managed to test my actual dataset on a massive 1TB RAM machine and sure, it did finish after a few days and using close to 400GB RAM 😄
Maybe there are just too many combinations that take time (typical problem for Market Basket datasets).
The memory blowing up is likely due to using of frozensets, which requires 3x-10x more memory than list of characters or integers (The frozenset columns itself use 90%+ of the memory). I do understand the need for it though. I'll probably experiment with some changes in the source code to avoid frozensets that suit my need.
@rasbt Thank You so much for checking and your time.
Glad you got it to run, and wow, yeah, that's definitely a big memory hog. Regarding the frozensets, that's a bummer that they take up so much memory. It's been a long time, but I remember having chosen frozensets because it improved efficiency compared to regular sets. In case you compare the two again in terms of memory needs and find that regular sets perform better, please let me know, I am happy to reconsider implementation choices.
this seems relevant here
Describe the bug
Running FP-Growth + Association Rules on two similar datasets, one runs successfully in 3mins using ~4GB RAM, other blows up to 16GB RAM and crashes.
Datasets
The two datasets have nothing really different, they are just randomly sampled from this dataset
data_1
shape is (17091, 3171)data_2
shape is (17419, 3194)datasets.zip
Steps/Code to Reproduce
Expected Results
Both should run similarly.
Actual Results
ar_1
runs within 5mins peaking at 4GB RAM usage.ar_2
blows up memory to 16GB RAM and crashes.Versions
Comments
Thanks a lot for creating and maintaining this library.