Open dale-wong opened 4 years ago
Are you using the same support value? Did your long run have more results?
Yes, same support value. It's hard to compare the results of the two runs since one is a superset of the other, and there is some crossover between the genders in items.
I mean did the longer run have more frequent itemsets?
Yes, the longer run had ~6.4M itemsets, whereas the faster run had ~3.7M itemsets.
Hm, the runtime complexity of fpgrowth is O(n^2) I think, where n is the number of unique items. So it's not at all unreasonable I think. I would check your memory usage though just in case the dataset is so large that your machine might be swapping memory.
Yes, but the key point is that the dataset with the longer runtime is a subset of the dataset with the faster runtime.
Hmm yeah this seems a little weird to me. If it's only twice the output size, I wouldn't expect so much of a difference with fpgrowth. The memory thing rabst mentioned could be right, that'd be something to check. You could also try using smaller datasets, but still doing the subset thing.
If you could give me a smaller sample of data where you see this affect (preferably on the minutes or less runtime) I could check it out too.
Yes, thank you both for your help and attention. I will need to anonymize the data before I can give you sample data sets. That might take a while; I've moved on and am using a different method instead. But I'd like to clear this up one way or another, and will get back to this when I have a spare moment. It's a good package, thanks for making it available.
Hello @dale-wong and @harenbergsd , I am very much interested in this discussion : did you eventually understand based on the provided data, why using a much smaller dataset was resulting in a much slower processing ? Was this an implementation edge case ?
Hello @dale-wong and @harenbergsd , I am very much interested in this discussion : did you eventually understand based on the provided data, why using a much smaller dataset was resulting in a much slower processing ? Was this an implementation edge case ?
I never got the data, so not sure. The fact that the smaller one took longer is not generally surprising though, because they used the same support value for both runs, which if it represented a percentage of the total transactions, means they are generating more itemsets.
Indeed this sounds like a reasonable explanation to the mysterious behaviour ! Thanks @harenbergsd for this clarification
Wondering if this is reasonable or not.
Full data set is comprised of about 100k users' purchases, about 20k one-hot items. Subset is just for about 50k male users' purchases, about 10k one-hot items. There is some crossover between male and female users' purchases, but it's in the minority.
The local part of the code that calls mlxtend stuff is identical.
I realize this can't really be answered without the data, just wondering if this is within the realm of expectation or if there might be a known problem in the fpgrowth implementation.