rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

https://rasbt.github.io/mlxtend/

Other

4.91k stars 872 forks source link

Performance Improvement of apriori function in frequent_patterns #549

Closed mizoma closed 5 years ago

mizoma commented 5 years ago

apriori function in frequent_patterns module takes a long time to run. I would like to improve the performance time.

rasbt commented 5 years ago

Yeah, they are currently not the fastest. It could be just that

a) apriori is not efficient on the datasets you are using. In this regard, someone was interested in adding FP-growth which could make it run faster on your dataset: #509

b) it could also be that multiprocessing, which is in the works (#510 ), may speed up apriori a little bit

c) and maybe the implementation is just bad and could be improved regarding speed

In any case, I would really welcome improvements. However, they should be under the hood if possible and not change the interface too much.

Because of a) and b), I would suggest maybe starting with looking at the speed of frequent_patterns.

harenbergsd commented 5 years ago

@rasbt summed it up well.

509 (fpgrowth) will likely help with performance, and I am nearly done. I will probably do a PR this weekend. Also, I believe multiprocessing fpgrowth would probably be fairly easy. I may work on that next.

Here are some numbers I just ran on 50k transactions of kosarak.dat from http://fimi.uantwerpen.be/data/.

support = 0.05
apriori  time = 1.283
fpgrowth time = 1.834
freq patterns reported apriori = 33 fpgrowth = 33

support = 0.02
apriori  time = 2.628
fpgrowth time = 1.995
freq patterns reported apriori = 125 fpgrowth = 125

support = 0.01
apriori  time = 13.741
fpgrowth time = 2.453
freq patterns reported apriori = 397 fpgrowth = 397

support = 0.005
apriori  time = 119.713
fpgrowth time = 3.076
freq patterns reported apriori = 1660 fpgrowth = 1660

One thing I have noticed is that the error handling at the beginning of the function, where it checks allowed_vals, can take a while. The numbers above don't include that time, which took around 16 seconds. As you can see, on the higher support thresholds, that can be the majority of the time.

Specifically, np.unique(df.values.ravel()), seems to be the culprit; so figuring out a better way to do that would be worthwhile.

rasbt commented 5 years ago

Wow, these numbers look great, really! That's quite an impressive improvement.

Regarding, np.unique(df.values.ravel()), that sounds bad indeed. I think the whole block,

    allowed_val = {0, 1, True, False}
    unique_val = np.unique(df.values.ravel())
    for val in unique_val:
        if val not in allowed_val:
            s = ('The allowed values for a DataFrame'
                 ' are True, False, 0, 1. Found value %s' % (val))
            raise ValueError(s)

is not ideal, in fact. There is probably a way to vectorize that for-loop. I guess when I wrote this, I meant to get the basic functionality down first, and I added this to avoid unexpected issues. However, this probably needs to be revised at some point.

harenbergsd commented 5 years ago

Yeah that makes sense. Interestingly, at least on my test, np.unique(df.values.ravel()) is the main thing that is slow. The loop was not bad, I guess because there are not a ton of unique items.

1JasonZhang commented 5 years ago

I installed mlxtend with pip and found that there is no fpgrowth inside, then I downloaded the frequent_patterns file under this github and copied it to the frequent_patterns folder under local mlxtend, but it didn't work. when i run frequent_itemsets = fpgrowth(df, min_support=0.02,use_colnames=True,max_len=5) error TypeError: 'module' object is not callable

rasbt commented 5 years ago

@1JasonZhang the function was just added a few days ago so it's not in the most recent version release and only in the development version. To install the latest development version, you can use

pip install git+git://github.com/rasbt/mlxtend.git

harenbergsd commented 5 years ago

@rasbt I briefly looked at the value check block we discussed above.

There is a df.isin function that I thought would be perfect, but it ran into a memory error on kosarak.dat data (your method did not).

Then I tried something like ((df==1) | (df==0)).all(); however, this was way slower than your method.

So, now I am thinking that what is there works quite nicely :)

rasbt commented 5 years ago

Oh interesting. Thanks for looking into that! Yeah, I think based on my general perception/experience, NumPy is usually a tad faster/more efficient than anything in pandas. Pandas is nice for convenience but usually comes with an additional overhead. Not saying there is no better way to implement that check, but I think it's probably not pandas

harenbergsd commented 5 years ago

Heh you are right. Numpy is faster... way faster.

Now, I think I have an improved solution. I believe we can do: np.all((df.values==1) | (df.values==0))

df.values is a numpy array so now these operations are from numpy rather than pandas. Massive difference.

If I use the above on 100k rows from kosarak.dat I get 13s, compared to the current implementation which gives 65s.

The pandas version, which almost looks identical: ((df==1) | (df==0)).all() takes 275s.

I can include this in the PR I will submit soon.

rasbt commented 5 years ago

Awesome. That sounds like a huge improvement indeed, with such a tiny change of code :).

When you modify this is in the PR, could you also add a note about that as a comment in code (maybe just above that line) so that we/someone else don't forget and change it back some time in future?

rasbt commented 5 years ago

I guess we can close that now after merging #553 .

@mizoma to install the current dev version from GitHub until a stable release comes out, you can use

pip uninstall mlxtend
pip install git+git://github.com/rasbt/mlxtend.git

Also, let us know if there are any questions