Closed mizoma closed 5 years ago
Yeah, they are currently not the fastest. It could be just that
a) apriori is not efficient on the datasets you are using. In this regard, someone was interested in adding FP-growth which could make it run faster on your dataset: #509
b) it could also be that multiprocessing, which is in the works (#510 ), may speed up apriori a little bit
c) and maybe the implementation is just bad and could be improved regarding speed
In any case, I would really welcome improvements. However, they should be under the hood if possible and not change the interface too much.
Because of a) and b), I would suggest maybe starting with looking at the speed of frequent_patterns.
@rasbt summed it up well.
Here are some numbers I just ran on 50k transactions of kosarak.dat from http://fimi.uantwerpen.be/data/.
support = 0.05
apriori time = 1.283
fpgrowth time = 1.834
freq patterns reported apriori = 33 fpgrowth = 33
support = 0.02
apriori time = 2.628
fpgrowth time = 1.995
freq patterns reported apriori = 125 fpgrowth = 125
support = 0.01
apriori time = 13.741
fpgrowth time = 2.453
freq patterns reported apriori = 397 fpgrowth = 397
support = 0.005
apriori time = 119.713
fpgrowth time = 3.076
freq patterns reported apriori = 1660 fpgrowth = 1660
One thing I have noticed is that the error handling at the beginning of the function, where it checks allowed_vals, can take a while. The numbers above don't include that time, which took around 16 seconds. As you can see, on the higher support thresholds, that can be the majority of the time.
Specifically, np.unique(df.values.ravel()), seems to be the culprit; so figuring out a better way to do that would be worthwhile.
Wow, these numbers look great, really! That's quite an impressive improvement.
Regarding, np.unique(df.values.ravel())
, that sounds bad indeed. I think the whole block,
allowed_val = {0, 1, True, False}
unique_val = np.unique(df.values.ravel())
for val in unique_val:
if val not in allowed_val:
s = ('The allowed values for a DataFrame'
' are True, False, 0, 1. Found value %s' % (val))
raise ValueError(s)
is not ideal, in fact. There is probably a way to vectorize that for-loop. I guess when I wrote this, I meant to get the basic functionality down first, and I added this to avoid unexpected issues. However, this probably needs to be revised at some point.
Yeah that makes sense. Interestingly, at least on my test, np.unique(df.values.ravel())
is the main thing that is slow. The loop was not bad, I guess because there are not a ton of unique items.
I installed mlxtend with pip and found that there is no fpgrowth inside, then I downloaded the frequent_patterns file under this github and copied it to the frequent_patterns folder under local mlxtend, but it didn't work. when i run frequent_itemsets = fpgrowth(df, min_support=0.02,use_colnames=True,max_len=5) error TypeError: 'module' object is not callable
@1JasonZhang the function was just added a few days ago so it's not in the most recent version release and only in the development version. To install the latest development version, you can use
pip install git+git://github.com/rasbt/mlxtend.git
@rasbt I briefly looked at the value check block we discussed above.
There is a df.isin function that I thought would be perfect, but it ran into a memory error on kosarak.dat data (your method did not).
Then I tried something like ((df==1) | (df==0)).all()
; however, this was way slower than your method.
So, now I am thinking that what is there works quite nicely :)
Oh interesting. Thanks for looking into that! Yeah, I think based on my general perception/experience, NumPy is usually a tad faster/more efficient than anything in pandas. Pandas is nice for convenience but usually comes with an additional overhead. Not saying there is no better way to implement that check, but I think it's probably not pandas
Heh you are right. Numpy is faster... way faster.
Now, I think I have an improved solution. I believe we can do:
np.all((df.values==1) | (df.values==0))
df.values is a numpy array so now these operations are from numpy rather than pandas. Massive difference.
If I use the above on 100k rows from kosarak.dat I get 13s, compared to the current implementation which gives 65s.
The pandas version, which almost looks identical: ((df==1) | (df==0)).all()
takes 275s.
I can include this in the PR I will submit soon.
Awesome. That sounds like a huge improvement indeed, with such a tiny change of code :).
When you modify this is in the PR, could you also add a note about that as a comment in code (maybe just above that line) so that we/someone else don't forget and change it back some time in future?
I guess we can close that now after merging #553 .
@mizoma to install the current dev version from GitHub until a stable release comes out, you can use
pip uninstall mlxtend
pip install git+git://github.com/rasbt/mlxtend.git
Also, let us know if there are any questions
apriori function in frequent_patterns module takes a long time to run. I would like to improve the performance time.