rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 857 forks source link

fpgrowth: runtime with dataset about 17 minutes, runtime with half-sized subset about 2.5 hours #676

Open dale-wong opened 4 years ago

dale-wong commented 4 years ago

Wondering if this is reasonable or not.

Full data set is comprised of about 100k users' purchases, about 20k one-hot items. Subset is just for about 50k male users' purchases, about 10k one-hot items. There is some crossover between male and female users' purchases, but it's in the minority.

The local part of the code that calls mlxtend stuff is identical.

I realize this can't really be answered without the data, just wondering if this is within the realm of expectation or if there might be a known problem in the fpgrowth implementation.

harenbergsd commented 4 years ago

Are you using the same support value? Did your long run have more results?

dale-wong commented 4 years ago

Yes, same support value. It's hard to compare the results of the two runs since one is a superset of the other, and there is some crossover between the genders in items.

harenbergsd commented 4 years ago

I mean did the longer run have more frequent itemsets?

dale-wong commented 4 years ago

Yes, the longer run had ~6.4M itemsets, whereas the faster run had ~3.7M itemsets.

rasbt commented 4 years ago

Hm, the runtime complexity of fpgrowth is O(n^2) I think, where n is the number of unique items. So it's not at all unreasonable I think. I would check your memory usage though just in case the dataset is so large that your machine might be swapping memory.

dale-wong commented 4 years ago

Yes, but the key point is that the dataset with the longer runtime is a subset of the dataset with the faster runtime.

harenbergsd commented 4 years ago

Hmm yeah this seems a little weird to me. If it's only twice the output size, I wouldn't expect so much of a difference with fpgrowth. The memory thing rabst mentioned could be right, that'd be something to check. You could also try using smaller datasets, but still doing the subset thing.

If you could give me a smaller sample of data where you see this affect (preferably on the minutes or less runtime) I could check it out too.

dale-wong commented 4 years ago

Yes, thank you both for your help and attention. I will need to anonymize the data before I can give you sample data sets. That might take a while; I've moved on and am using a different method instead. But I'd like to clear this up one way or another, and will get back to this when I have a spare moment. It's a good package, thanks for making it available.

smarie commented 2 months ago

Hello @dale-wong and @harenbergsd , I am very much interested in this discussion : did you eventually understand based on the provided data, why using a much smaller dataset was resulting in a much slower processing ? Was this an implementation edge case ?

harenbergsd commented 2 months ago

Hello @dale-wong and @harenbergsd , I am very much interested in this discussion : did you eventually understand based on the provided data, why using a much smaller dataset was resulting in a much slower processing ? Was this an implementation edge case ?

I never got the data, so not sure. The fact that the smaller one took longer is not generally surprising though, because they used the same support value for both runs, which if it represented a percentage of the total transactions, means they are generating more itemsets.

smarie commented 2 months ago

Indeed this sounds like a reasonable explanation to the mysterious behaviour ! Thanks @harenbergsd for this clarification