Presence of large baskets dramatically increases runtime (FPGrowth)

Hello all,

I've been working with your wonderful library to derive product association rules to identify items that are frequently bought together using FPGrowth. Typically, the algorithm takes no more than 2.5 minutes to derive a set of association rules. However, an ongoing problem that I've been dealing with is that for some data sets, the FPGrowth function would randomly take 30 minutes or more to complete. Sometimes, the memory usage would become so extreme that the kernel simply crashes.

I recently discovered the source of the problem - the runtime spikes only occurred in cases where the training data contained large customer baskets. I can't provide the exact data set because it is proprietary, but I can give you these statistics about the data that were producing the dramatically increased runtimes:

The training data contained around 100k customer baskets.
The vast majority (~99%) of baskets contained between 2-10 purchased items in the order.
A very small percentage of the baskets contained 50 or more items. To be exact, there were ~30 baskets containing 50 or more purchased items in this problematic data set.

After I put in some code to drop these extremely large customer baskets from the data prior to training FPGrowth, the issue was resolved, and the model completed in a normal amount of time, yielding an expected number of association rules.

I do not know why FPGrowth would be susceptible to these very large baskets causing an explosion in runtime, but I thought I'd report it to you guys in case this was not something you were aware of.

Thanks!

rasbt / mlxtend

Presence of large baskets dramatically increases runtime (FPGrowth) #909