rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.83k stars 855 forks source link

Presence of large baskets dramatically increases runtime (FPGrowth) #909

Open Ncalverley opened 2 years ago

Ncalverley commented 2 years ago

Hello all,

I've been working with your wonderful library to derive product association rules to identify items that are frequently bought together using FPGrowth. Typically, the algorithm takes no more than 2.5 minutes to derive a set of association rules. However, an ongoing problem that I've been dealing with is that for some data sets, the FPGrowth function would randomly take 30 minutes or more to complete. Sometimes, the memory usage would become so extreme that the kernel simply crashes.

I recently discovered the source of the problem - the runtime spikes only occurred in cases where the training data contained large customer baskets. I can't provide the exact data set because it is proprietary, but I can give you these statistics about the data that were producing the dramatically increased runtimes:

After I put in some code to drop these extremely large customer baskets from the data prior to training FPGrowth, the issue was resolved, and the model completed in a normal amount of time, yielding an expected number of association rules.

I do not know why FPGrowth would be susceptible to these very large baskets causing an explosion in runtime, but I thought I'd report it to you guys in case this was not something you were aware of.

Thanks!