Open tsvikas opened 4 years ago
Thank you for your proposal.
I believe that we want to keep joblib without additional dependencies. Unfortunately, this implies that we cannot add a dependency on tqdm.
Thank you for the reply!
Let me say that tqdm
is a lean package, with no other dependencies. I believe that using packages that focus on certain aspects can improve our code readability and maintainability.
However, i can try to implement a tqdm-like progressbar, without adding tqdm to the dependencies, if you find this an interesting PR.
Alternatively, I will be happy to hear if there is any advice about how to release my tqdm-joblib code, not through joblib directly.
We could make it an optional dependency, no?
Maybe we could make it an optional dependency, though I am not too excited about this.
Also, one thing to be careful would be not to consume the iterator for the scrollbar. I do not know how tqdm works, but it probably needs the number of operations, seen as the length of the list when it's a list. If, to get this, it expands the iterator of operations, it can blow up the memory.
tqdm and my code are well written in that manner, and does not consume the iterator.
my code does call len(iterator)
, which, as far as I know, is safe for the iterator, and my code handles the exceptions from a missing len()
.
Apart from that, I've only used Parallel().n_dispatched_tasks
and Parallel().n_completed_tasks
.
I'll try to upload a GIF to illustrate
(sorry for the 80 characters width)
this is the same example from your docs.
total_tasks=10
parameter, which allows a progressbar to show from the start.total_tasks
, since the code just uses the list length.A clearer GIF, I hope.
@tsvikas : ok, it is clear that the code is not expending the iterator when it shouldn't. Good job.
I am hesitating on that one: if the code to add tqdm as an optional dependency is light, I definitely see the value.
That said, I also worry that such a behavior that depends on packages installed may confuse users.
@joblib/committers anybody has an opinion?
I think that if it is possible to add support with optional dependency/extension, this would be a good idea. The output in verbose joblib would be improved with it. But I agree with Gael that this should be turned on explicitly by the user, either by installing a dedicated joblib-tqdm
extension or with an argument, to avoid having different behavior only based on wether tqdm
is installed or not.
From an API perspective, this is not so clear where we should do it. Having a separate class is probably not a good idea. Maybe passing some non integer argument to verbose? The best would be something with an extension but this requires adding some customizable callbacks which is probably much more complicated.
I asked myself whether tqdm
is better than Parallel(verbose=10)
?
In both variants you get the elapsed and remaining time.
So I made a step back and thought about alternative solutions. Is it possible that we use tdqm
as an decorator to an iterator? isinstance(Parallel, Callable)
returns True
. There is an open issue related to a similar requirement: https://github.com/tqdm/tqdm/issues/583.
@tdqm(n=3)
Parallel(n_jobs=2)(delayed(wait)(sec) for sec in [4, 5, 6])
In general I like tqdm
and use it daily to track long running processes.
If the code snippet above can be made to work, it would be amazing! We would be glad to add an example to the joblib gallery to showcase it.
I think the new joblib 1.3 feature that allows to return a generator, makes this integration much easier. I should be possible to just use tqdm natively.
Haven't tested it yet though...
Ok just tested it and it seems to work as expected.
Wrapping the iterator returned by joblib parallel in tqdm works without any further modification.
import joblib
from tqdm import tqdm
import time
def workload(i):
time.sleep(5)
return i
n_iter = 4
result = [r for r in tqdm(joblib.Parallel(return_as="generator", n_jobs=2)(joblib.delayed(workload)(i) for i in range(n_iter)), total=n_iter)]
print(result)
Is this something that should be added to the docu?
Yes, I think it would be nice to document this as this feature as been requested several times. Maybe as an example?
Maybe as an example?
I think that as an example is a good idea
@AKuederle
Ok just tested it and it seems to work as expected.
Wrapping the iterator returned by joblib parallel in tqdm works without any further modification.
This example wouldn't work if you have backend='multiprocessing'
because return_as='generator'
wouldn't be available. And sometimes, you have to resort to non-loky backend.
I think having a tqdm
example would help a lot, as then I can also send the updates to my Telegram bot for big runs on the cluster. In @AKuederle's example, one just simply needs to do the following:
import joblib from tqdm.contrib.telegram import tqdm import time TOKEN = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' CHAT_ID = 'xxxxxxxxxxxxxxxxxxx' def workload(i): time.sleep(5) return i n_iter = 4 result = [r for r in tqdm(joblib.Parallel(return_as="generator", n_jobs=2)(joblib.delayed(workload)(i) for i in range(n_iter)), total=n_iter, token=TOKEN, chat_id=CHAT_ID)] print(result)
I tested this, and it works as expected.
I don't know if there is already an implementation in joblib for sending updates to Telegram.
in case it will help someone - here is a subclass of Parallel that uses tqdm: (also in gist)
import tqdm
from joblib import Parallel
class ParallelTqdm(Parallel):
"""joblib.Parallel, but with a tqdm progressbar
Additional parameters:
----------------------
total_tasks: int, default: None
the number of expected jobs. Used in the tqdm progressbar.
If None, try to infer from the length of the called iterator, and
fallback to use the number of remaining items as soon as we finish
dispatching.
Note: use a list instead of an iterator if you want the total_tasks
to be inferred from its length.
desc: str, default: None
the description used in the tqdm progressbar.
disable_progressbar: bool, default: False
If True, a tqdm progressbar is not used.
show_joblib_header: bool, default: False
If True, show joblib header before the progressbar.
Removed parameters:
-------------------
verbose: will be ignored
Usage:
------
>>> from joblib import delayed
>>> from time import sleep
>>> ParallelTqdm(n_jobs=-1)([delayed(sleep)(.1) for _ in range(10)])
80%|████████ | 8/10 [00:02<00:00, 3.12tasks/s]
"""
def __init__(
self,
*,
total_tasks: int | None = None,
desc: str | None = None,
disable_progressbar: bool = False,
show_joblib_header: bool = False,
**kwargs
):
if "verbose" in kwargs:
raise ValueError(
"verbose is not supported. "
"Use show_progressbar and show_joblib_header instead."
)
super().__init__(verbose=(1 if show_joblib_header else 0), **kwargs)
self.total_tasks = total_tasks
self.desc = desc
self.disable_progressbar = disable_progressbar
self.progress_bar: tqdm.tqdm | None = None
def __call__(self, iterable):
try:
if self.total_tasks is None:
# try to infer total_tasks from the length of the called iterator
try:
self.total_tasks = len(iterable)
except (TypeError, AttributeError):
pass
# call parent function
return super().__call__(iterable)
finally:
# close tqdm progress bar
if self.progress_bar is not None:
self.progress_bar.close()
__call__.__doc__ = Parallel.__call__.__doc__
def dispatch_one_batch(self, iterator):
# start progress_bar, if not started yet.
if self.progress_bar is None:
self.progress_bar = tqdm.tqdm(
desc=self.desc,
total=self.total_tasks,
disable=self.disable_progressbar,
unit="tasks",
)
# call parent function
return super().dispatch_one_batch(iterator)
dispatch_one_batch.__doc__ = Parallel.dispatch_one_batch.__doc__
def print_progress(self):
"""Display the process of the parallel execution using tqdm"""
# if we finish dispatching, find total_tasks from the number of remaining items
if self.total_tasks is None and self._original_iterator is None:
self.total_tasks = self.n_dispatched_tasks
self.progress_bar.total = self.total_tasks
self.progress_bar.refresh()
# update progressbar
self.progress_bar.update(self.n_completed_tasks - self.progress_bar.n)
I have created a subclass of joblib.Parallel that uses the 'tqdm' package for the progress bar. I find this method more comfortable - the updates are rapid, but uses
\r
to avoid cluttering the screen. The code I used also finds the number of tasks at the start (if it's an iterable with len) and uses it.I want to contribute this code somehow - should I open a PR, or do it in any other way?
Thanks!