Open ajkl opened 8 years ago
Hi, thanks for the suggestion/request, supporting joblib sounds like a useful feature. Personally, I haven't experimented with this combo, yet.
So, I could think of 2 possible scenarios here:
1) Having an outer for loop that runs multiple joblib processes iteratively and updates them like so
pbar = ProgBar(len(x))
for _ in x:
# do something w. joblib in parallel
pbar.update()
which would already work I guess.
2) Tracking the process inside joblib. Here, you are running multiple processes via joblib where each of them has a for-loop. The goal is to
def some_func():
for _ in x:
# so something
pbar = ProgBar(n)
# run multiple instances of some_func in parallel
# let all processes update the pbar
Is this what you have in mind? I think in theory this should be easily possible; all the processes would have to do is to call the update
method I guess!? Would be nice if you have some example code that we could use to experiment a bit.
The second option is what I was looking for but doesnt seem to work with your suggestion of letting all processes update pbar
from joblib import Parallel, delayed
import time
import pyprind
timesleep = 0.05
n=1000
bar = pyprind.ProgBar(n)
def foo(x):
time.sleep(timesleep)
bar.update()
return x
Parallel(n_jobs=4, verbose=0)(delayed(foo)(i) for i in range(n))
Hm, I think the problem is that the standard output is blocked during the computation which is why the pogressbar appears after everything has finished. I think this is something to investigate further after the "double progressbar" support had been added (see #18)
In any case, another problem is that multiprocessing created copies of the objects that are send to the different processors (in contrast to threading). So basically, there are 4 progressbars then that are running from 0% to 25% each if you use 4 processors.
Honest question: What's the advantage of joblib over multiprocessing? I saw it in certain libraries (e.g., scikit-learn) but never really understood why joblib instead of multiprocessing. E.g.,
from joblib import Parallel, delayed
import time
import pyprind
timesleep = 0.05
n = 1000
n_jobs = 4
bar = pyprind.ProgBar(n, stream=1)
def foo(x):
time.sleep(timesleep)
bar.update()
return x
results = Parallel(n_jobs=n_jobs,
verbose=0,
backend="multiprocessing")(delayed(foo)(i) for i in range(n))
vs.
import multiprocessing as mp
pool = mp.Pool(processes=2)
results = [pool.apply(foo, args=(x,)) for x in range(n)]
well i am kinda new to the python ecosystem and I recently came across joblib. I noticed sklearn is using it, so kinda assumed it must be solving some issues that multiprocessing might have. Honestly didnt evaluate the 2 yet. I understand that multiprocessing is creating different object hence you always see 25% on the above example. Not sure if there is an easy solution around it. I dont want to waste your time since it is not that critical. Thanks for this great package!
I understand that multiprocessing is creating different object hence you always see 25% on the above example. Not sure if there is an easy solution around it.
I think there could be a way around that ... but it'll require some tweaks. Btw. if you use the "threading" backend, it should give you the 100% correctly but the problem is still how to print to stdout while the processing are still running...
Not sure if there is an easy solution around it. I dont want to waste your time since it is not that critical. Thanks for this great package!
Unfortunately, there are too many things on my to do list, currently. But I will leave this issue open, maybe someone has a good idea how to implement it, or maybe there will be a boring weekend for me some day ... ;)
Is it possible to use the callback from joblib parallel to make it work with pyprind for Parallel processing tasks ?