ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

[Data] Too many progress bars when using Data with Train #47735

Open vladjohnson opened 2 months ago

vladjohnson commented 2 months ago

What happened + What you expected to happen

Screenshot

Hey guys, looking for the way to fix this mess... tqdm is creating a bunch of progress bars and my logs keep growing the notebook to a massive size. I've tried setting RAY_DATA_DISABLE_PROGRESS_BARS=1, but that did not help. How do I either turn off the progress bars or ideally, make them work as they are supposed to (one single progress bar)?

Thanks

Versions / Dependencies

Ray version: 2.35.0 Python version: 3.11.9 OS: Ubuntu 20.04

Reproduction script

trainer = TorchTrainer(
        demo_train_loop_per_worker,
        train_loop_config={
            "experiment_name": "demo_experiment",
            "tracking_uri": "file:~/.cache/mlruns",
            "train_batch_size": 1000,
            "num_epochs": 100,
        },
        datasets={
            "train": train_ds,
        },
        scaling_config=ray.train.ScalingConfig(
            num_workers=1,
            use_gpu=True,
        ),
    )

Issue Severity

Medium: It is a significant difficulty but I can work around it.

scottjlee commented 2 months ago

Thanks for reporting the issue; this behavior is definitely not expected.

Especially with setting RAY_DATA_DISABLE_PROGRESS_BARS=1, this should definitely disable the progress bar. My first thought is that you should pass this env var into the ray runtime env. For example, if you are using ray.init(), you can pass it into env_vars (see the docs). This will ensure all workers get this env var, and disables progress bars properly.

You can also explicitly set the variable in DataContext:

ctx = ray.data.DataContext.get_current()
ctx.enable_progress_bars = False

If the above doesn't work, I also have a few other temporary fixes to suggest:

(1)

ctx = ray.data.DataContext.get_current()
ctx.use_ray_tqdm = False

This disables the special tqdm implementation for distributed settings, which Ray Data uses to manage progress bars across multiple workers.

(2) Another temporary workaround that might work:

ctx.enable_operator_progress_bars = False

this disables operator-level progress bars, so that it only shows the top-level global progress bar. Although this won't resolve the issue completely, it will at least help reduce the output spam.

vladjohnson commented 2 months ago

Thank you so much, @scottjlee! Highly appreciated