joblib / joblib

Computing with Python functions.
http://joblib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
3.88k stars 416 forks source link

use tqdm for progressbar #972

Open tsvikas opened 4 years ago

tsvikas commented 4 years ago

I have created a subclass of joblib.Parallel that uses the 'tqdm' package for the progress bar. I find this method more comfortable - the updates are rapid, but uses \r to avoid cluttering the screen. The code I used also finds the number of tasks at the start (if it's an iterable with len) and uses it.

I want to contribute this code somehow - should I open a PR, or do it in any other way?

Thanks!

GaelVaroquaux commented 4 years ago

Thank you for your proposal.

I believe that we want to keep joblib without additional dependencies. Unfortunately, this implies that we cannot add a dependency on tqdm.

tsvikas commented 4 years ago

Thank you for the reply!

Let me say that tqdm is a lean package, with no other dependencies. I believe that using packages that focus on certain aspects can improve our code readability and maintainability.

However, i can try to implement a tqdm-like progressbar, without adding tqdm to the dependencies, if you find this an interesting PR.

Alternatively, I will be happy to hear if there is any advice about how to release my tqdm-joblib code, not through joblib directly.

ogrisel commented 4 years ago

We could make it an optional dependency, no?

GaelVaroquaux commented 4 years ago

Maybe we could make it an optional dependency, though I am not too excited about this.

Also, one thing to be careful would be not to consume the iterator for the scrollbar. I do not know how tqdm works, but it probably needs the number of operations, seen as the length of the list when it's a list. If, to get this, it expands the iterator of operations, it can blow up the memory.

tsvikas commented 4 years ago

tqdm and my code are well written in that manner, and does not consume the iterator. my code does call len(iterator), which, as far as I know, is safe for the iterator, and my code handles the exceptions from a missing len(). Apart from that, I've only used Parallel().n_dispatched_tasks and Parallel().n_completed_tasks.

I'll try to upload a GIF to illustrate

tsvikas commented 4 years ago

render1576430473023 (sorry for the 80 characters width)

this is the same example from your docs.

tsvikas commented 4 years ago

render1576492216829 A clearer GIF, I hope.

GaelVaroquaux commented 4 years ago

@tsvikas : ok, it is clear that the code is not expending the iterator when it shouldn't. Good job.

I am hesitating on that one: if the code to add tqdm as an optional dependency is light, I definitely see the value.

That said, I also worry that such a behavior that depends on packages installed may confuse users.

@joblib/committers anybody has an opinion?

tomMoral commented 4 years ago

I think that if it is possible to add support with optional dependency/extension, this would be a good idea. The output in verbose joblib would be improved with it. But I agree with Gael that this should be turned on explicitly by the user, either by installing a dedicated joblib-tqdm extension or with an argument, to avoid having different behavior only based on wether tqdm is installed or not.

From an API perspective, this is not so clear where we should do it. Having a separate class is probably not a good idea. Maybe passing some non integer argument to verbose? The best would be something with an extension but this requires adding some customizable callbacks which is probably much more complicated.

nok commented 4 years ago

I asked myself whether tqdm is better than Parallel(verbose=10)? In both variants you get the elapsed and remaining time.

So I made a step back and thought about alternative solutions. Is it possible that we use tdqm as an decorator to an iterator? isinstance(Parallel, Callable) returns True. There is an open issue related to a similar requirement: https://github.com/tqdm/tqdm/issues/583.

@tdqm(n=3)
Parallel(n_jobs=2)(delayed(wait)(sec) for sec in [4, 5, 6])

In general I like tqdm and use it daily to track long running processes.

GaelVaroquaux commented 4 years ago

If the code snippet above can be made to work, it would be amazing! We would be glad to add an example to the joblib gallery to showcase it.

AKuederle commented 1 year ago

I think the new joblib 1.3 feature that allows to return a generator, makes this integration much easier. I should be possible to just use tqdm natively.

Haven't tested it yet though...

AKuederle commented 1 year ago

Ok just tested it and it seems to work as expected.

Wrapping the iterator returned by joblib parallel in tqdm works without any further modification.

import joblib
from tqdm import tqdm
import time

def workload(i):
    time.sleep(5)
    return i

n_iter = 4

result = [r for r in tqdm(joblib.Parallel(return_as="generator", n_jobs=2)(joblib.delayed(workload)(i) for i in range(n_iter)), total=n_iter)]
print(result)

Is this something that should be added to the docu?

tomMoral commented 1 year ago

Yes, I think it would be nice to document this as this feature as been requested several times. Maybe as an example?

GaelVaroquaux commented 1 year ago

Maybe as an example?

I think that as an example is a good idea

aldanor commented 1 year ago

@AKuederle

Ok just tested it and it seems to work as expected.

Wrapping the iterator returned by joblib parallel in tqdm works without any further modification.

This example wouldn't work if you have backend='multiprocessing' because return_as='generator' wouldn't be available. And sometimes, you have to resort to non-loky backend.

Bhavesh012 commented 10 months ago

I think having a tqdm example would help a lot, as then I can also send the updates to my Telegram bot for big runs on the cluster. In @AKuederle's example, one just simply needs to do the following:

import joblib
from tqdm.contrib.telegram import tqdm
import time

TOKEN = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
CHAT_ID = 'xxxxxxxxxxxxxxxxxxx'

def workload(i):
    time.sleep(5)
    return i

n_iter = 4

result = [r for r in tqdm(joblib.Parallel(return_as="generator", n_jobs=2)(joblib.delayed(workload)(i) for i in range(n_iter)), total=n_iter, token=TOKEN, chat_id=CHAT_ID)]
print(result)

I tested this, and it works as expected.

I don't know if there is already an implementation in joblib for sending updates to Telegram.

tsvikas commented 10 months ago

in case it will help someone - here is a subclass of Parallel that uses tqdm: (also in gist)

import tqdm
from joblib import Parallel

class ParallelTqdm(Parallel):
    """joblib.Parallel, but with a tqdm progressbar

    Additional parameters:
    ----------------------
    total_tasks: int, default: None
        the number of expected jobs. Used in the tqdm progressbar.
        If None, try to infer from the length of the called iterator, and
        fallback to use the number of remaining items as soon as we finish
        dispatching.
        Note: use a list instead of an iterator if you want the total_tasks
        to be inferred from its length.

    desc: str, default: None
        the description used in the tqdm progressbar.

    disable_progressbar: bool, default: False
        If True, a tqdm progressbar is not used.

    show_joblib_header: bool, default: False
        If True, show joblib header before the progressbar.

    Removed parameters:
    -------------------
    verbose: will be ignored

    Usage:
    ------
    >>> from joblib import delayed
    >>> from time import sleep
    >>> ParallelTqdm(n_jobs=-1)([delayed(sleep)(.1) for _ in range(10)])
    80%|████████  | 8/10 [00:02<00:00,  3.12tasks/s]

    """

    def __init__(
        self,
        *,
        total_tasks: int | None = None,
        desc: str | None = None,
        disable_progressbar: bool = False,
        show_joblib_header: bool = False,
        **kwargs
    ):
        if "verbose" in kwargs:
            raise ValueError(
                "verbose is not supported. "
                "Use show_progressbar and show_joblib_header instead."
            )
        super().__init__(verbose=(1 if show_joblib_header else 0), **kwargs)
        self.total_tasks = total_tasks
        self.desc = desc
        self.disable_progressbar = disable_progressbar
        self.progress_bar: tqdm.tqdm | None = None

    def __call__(self, iterable):
        try:
            if self.total_tasks is None:
                # try to infer total_tasks from the length of the called iterator
                try:
                    self.total_tasks = len(iterable)
                except (TypeError, AttributeError):
                    pass
            # call parent function
            return super().__call__(iterable)
        finally:
            # close tqdm progress bar
            if self.progress_bar is not None:
                self.progress_bar.close()

    __call__.__doc__ = Parallel.__call__.__doc__

    def dispatch_one_batch(self, iterator):
        # start progress_bar, if not started yet.
        if self.progress_bar is None:
            self.progress_bar = tqdm.tqdm(
                desc=self.desc,
                total=self.total_tasks,
                disable=self.disable_progressbar,
                unit="tasks",
            )
        # call parent function
        return super().dispatch_one_batch(iterator)

    dispatch_one_batch.__doc__ = Parallel.dispatch_one_batch.__doc__

    def print_progress(self):
        """Display the process of the parallel execution using tqdm"""
        # if we finish dispatching, find total_tasks from the number of remaining items
        if self.total_tasks is None and self._original_iterator is None:
            self.total_tasks = self.n_dispatched_tasks
            self.progress_bar.total = self.total_tasks
            self.progress_bar.refresh()
        # update progressbar
        self.progress_bar.update(self.n_completed_tasks - self.progress_bar.n)