rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 857 forks source link

Multiprocessing for apriori algorithm doesn't work as expected #510

Open littlewine opened 5 years ago

littlewine commented 5 years ago

I am trying to run the apriori algorithm (with the final goal of creating association rules) and I want to use the multiprocessing feature this library seems to offer. I only care about itemsets of max_length=2, in other words only frequent pairs of items. To facilitate the multiprocessing, I am setting the parameter n_jobs to different values. I run a benchmark and measured the processing time and the outcome is that the n_jobs param doesn't really work/offer anything. On a (10000 transactions, 8545 products) sparse DataFrame and benchmarked results for n_jobs=1,n_jobs=-1 and n_jobs=6, but in all cases the running time was between 7841-8041 seconds. CPU utilization also doesn't seem to increase that much either; even after using 6 jobs it was most of the time below 15%, so I think it was using only one kernel.

My intuition tells me that the apriori algorithm can be parallelized very efficiently, especially when the most expensive thing to compute is the large amount of item combinations (in that case, each worker could take an predefined set of pairs and start counting).

rasbt commented 5 years ago

I remember working on n_jobs but it seems that the online version on GitHub doesn't have multiprocessing support currently. Weird. Looks like n_jobs is not doing anything right now.

littlewine commented 5 years ago

Is there a chance that it is readily available somewhere and you could push it? Otherwise would it be hard to re-implement it? There's a chance I could help if you could guide me a bit on it!

On Tue, Mar 5, 2019, 17:49 Sebastian Raschka <notifications@github.com wrote:

I remember working on n_jobs but it seems that the online version on GitHub doesn't have multiprocessing support currently. Weird. Looks like n_jobs is not doing anything right now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rasbt/mlxtend/issues/510#issuecomment-469757787, or mute the thread https://github.com/notifications/unsubscribe-auth/ALOY4RuXQHvMXLh9IaN86Qd9QpnjWEBoks5vTqAygaJpZM4beGzj .

rasbt commented 5 years ago

Sorry, I somehow missed to follow up on this. For some reason, I can't find the code for this and wonder if I really wrote MP code for apriori at some point and didn't commit it or whether I just remember incorrectly.

janismdhanbad commented 5 years ago

Hi @rasbt , I can take this up. There are several ways to implement multiprocessing I think. I was looking at the code and one possible place to use multiprocessing could be this(https://github.com/rasbt/mlxtend/blob/92753e0944f9b64bbaefebfcfe67ae03a0855fe4/mlxtend/frequent_patterns/apriori.py#L150)? I was thinking that we can take c and pass it to different processes as an iterator. Is that a right approach?

rasbt commented 5 years ago

Thanks for considering this! I think we need to be a bit careful with making the parallelization efficient due to the "bottom-up subset exploration" approach that apriori takes. I.e., in apriori, we generate the itemsets by extending one at a time -- a given itemset can only be a frequent itemset if its subset was a frequent itemset, which illustrated in that figure:

7-Lattice

However, this is taken care of in the generate_new_combinations(old_combinations) generator. Hence, like you said, it makes sense to parallelize the processing of the combination for the given time step (i.e., itemset size) within the

for c in combin:

for loop.

I remember that one danger with Python's multiprocessing map utilities is that they evaluate e.g., an generator as a list, which could potentially eat up a lot of memory if there number of combinations is large.

Some time ago, I wrote a small utility for that. Maybe have a look at the lazy_map function there -- I think that could be useful here!? https://github.com/rasbt/mputil/blob/master/examples/lazy_map-lazy_imap.ipynb

I.e., the lazy_map/lazy_imap functions would exactly be for cases such as the generate_new_combinations(old_combinations) generator, where we don't want to evaluate the whole list upfront.

janismdhanbad commented 5 years ago

The generator as a list thing was exactly I was concerned about. Thanks a lot for pointing out lazy_map function. There is one more thing, as per my understanding, I need to make a function which can compute the c having support greater than minimum support. That function can then be processed with multiprocessing module in python taking in the input from lazy_map. My concern is that the variable X, which is the complete dataframe needs to be passed as an input to that function in all the processes. Should I make an Apriori python class and then take X as a class variable? That way I don't have to pass the variable X to the multiprocessing.pool() function in python. Or else, if passing X doesn't hamper the performance I can pass it as a parameter as well.

rasbt commented 5 years ago

Hm, I don't think that passing X should be a big concern. I am honestly a bit fuzzy on that, but I think it would be passed as a reference, not copy in memory!? So, this way, it wouldn't take up additional memory. But i am only 30% sure about that -- multiprocessing may make clones of everything.

In that case, it's maybe better if the function that is passed to lazy_map could read X from the outer function scope. (I've done sth like that here: https://github.com/rasbt/screenlamp/blob/c0da04fa00472890880b0817bd1f61d013638c9b/tools/funcgroup_distance_to_id.py)

Btw, we can also put the lazy_map function into https://github.com/rasbt/mlxtend/tree/master/mlxtend/externals, as it is relatively small. Also, this way, we don't have to add yet another installation requirement/dependency.

janismdhanbad commented 5 years ago

I applied the discussed approach but somehow multiprocessing.Pool is running slower than the code running on single core code, any ideas on that?

rasbt commented 5 years ago

In some cases that might happen if there are things that are very quick to calculate and there is an overhead of spawning subprocess. In most cases though, for longer running code, I've never found that multiprocessing.Pool would run slower. Maybe you are applying it to a small dataframe?