rasbt / pyprind

PyPrind - Python Progress Indicator Utility
BSD 3-Clause "New" or "Revised" License
550 stars 66 forks source link

pyprind with joblib #21

Open ajkl opened 8 years ago

ajkl commented 8 years ago

Is it possible to use the callback from joblib parallel to make it work with pyprind for Parallel processing tasks ?

rasbt commented 8 years ago

Hi, thanks for the suggestion/request, supporting joblib sounds like a useful feature. Personally, I haven't experimented with this combo, yet.

So, I could think of 2 possible scenarios here:

1) Having an outer for loop that runs multiple joblib processes iteratively and updates them like so

pbar = ProgBar(len(x))
for _ in x:
    # do something w. joblib in parallel
    pbar.update()

which would already work I guess.

2) Tracking the process inside joblib. Here, you are running multiple processes via joblib where each of them has a for-loop. The goal is to

def some_func():
    for _ in x:
        # so something
pbar = ProgBar(n)
# run multiple instances of some_func in parallel
# let all processes update the pbar

Is this what you have in mind? I think in theory this should be easily possible; all the processes would have to do is to call the update method I guess!? Would be nice if you have some example code that we could use to experiment a bit.

ajkl commented 8 years ago

The second option is what I was looking for but doesnt seem to work with your suggestion of letting all processes update pbar

from joblib import Parallel, delayed
import time
import pyprind
timesleep = 0.05
n=1000
bar = pyprind.ProgBar(n)
def foo(x):
    time.sleep(timesleep)
    bar.update()
    return x
Parallel(n_jobs=4, verbose=0)(delayed(foo)(i) for i in range(n))
rasbt commented 8 years ago

Hm, I think the problem is that the standard output is blocked during the computation which is why the pogressbar appears after everything has finished. I think this is something to investigate further after the "double progressbar" support had been added (see #18)

In any case, another problem is that multiprocessing created copies of the objects that are send to the different processors (in contrast to threading). So basically, there are 4 progressbars then that are running from 0% to 25% each if you use 4 processors.

Honest question: What's the advantage of joblib over multiprocessing? I saw it in certain libraries (e.g., scikit-learn) but never really understood why joblib instead of multiprocessing. E.g.,

from joblib import Parallel, delayed
import time
import pyprind

timesleep = 0.05
n = 1000
n_jobs = 4

bar = pyprind.ProgBar(n, stream=1)
def foo(x):
    time.sleep(timesleep)
    bar.update()
    return x

results = Parallel(n_jobs=n_jobs, 
                   verbose=0, 
                   backend="multiprocessing")(delayed(foo)(i) for i in range(n))

vs.

import multiprocessing as mp
pool = mp.Pool(processes=2)
results = [pool.apply(foo, args=(x,)) for x in range(n)]
ajkl commented 8 years ago

well i am kinda new to the python ecosystem and I recently came across joblib. I noticed sklearn is using it, so kinda assumed it must be solving some issues that multiprocessing might have. Honestly didnt evaluate the 2 yet. I understand that multiprocessing is creating different object hence you always see 25% on the above example. Not sure if there is an easy solution around it. I dont want to waste your time since it is not that critical. Thanks for this great package!

rasbt commented 8 years ago

I understand that multiprocessing is creating different object hence you always see 25% on the above example. Not sure if there is an easy solution around it.

I think there could be a way around that ... but it'll require some tweaks. Btw. if you use the "threading" backend, it should give you the 100% correctly but the problem is still how to print to stdout while the processing are still running...

Not sure if there is an easy solution around it. I dont want to waste your time since it is not that critical. Thanks for this great package!

Unfortunately, there are too many things on my to do list, currently. But I will leave this issue open, maybe someone has a good idea how to implement it, or maybe there will be a boring weekend for me some day ... ;)