Ray Tune: No console output is logged to Wandb.

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

32.32k stars 5.5k forks source link

Ray Tune: No console output is logged to Wandb. #25011

Open dirkweissenborn opened 2 years ago

dirkweissenborn commented 2 years ago

What happened + What you expected to happen

I am trying to run ray.tune jobs with wandb for experiment logging and was expecting to find the console output logs (logged using python's standard logging module) of the individual jobs in wandb. Curiously these console outputs appear on wandb when running a job through ray.remote but not when using ray.tune.

Note, that we initialize wandb directly in the remote function. Metrics are written as normal.

Is the console output somehow redirected when running through tune such that it is not caught by wandb? Any advice on how to fix that?

Versions / Dependencies

1.11

Reproduction script

None.

Issue Severity

No response

krfricke commented 2 years ago

Generally Ray Tune also just runs remote actor tasks, so the behavior should be the same. Can you share a quick reproduction script so we can look into this more?

nikitavemuri commented 1 year ago

It looks like output printed to W&B from Ray tasks does correctly show up in the W&B log tab, but not from Ray actors.

Example Ray task with output that does show up in W&B log tab:

@ray.remote
def wandb_init(): 
    import wandb 
    import sys 

    wandb.init() 

    print("Inside print")

    sys.stdout.write("Inside stdout") 
    sys.stdout.flush()

ray.init()
ray.get(wandb_init.remote())

Example Ray Actor task with output that does not show up in W&B log tab:

@ray.remote
class MyActor:
    def __init__(self):
        import wandb 
        wandb.init()

    def do_stuff(self):
        print("Inside print")
        import sys 
        sys.stdout.write("Inside stdout") 
        sys.stdout.flush()

actor = MyActor.remote()

ray.get(actor.do_stuff.remote())

I think we are seeing this issue because Ray Tune trials are run as Actor tasks

rickyyx commented 1 year ago

Met with @nikitavemuri offline - and had a minimal example with 2.0.0rc0 where printing from actor gets picked up in the logs. Seems that there might be something in the tune library that configures the wandb differently than the simple example above.

nikitavemuri commented 1 year ago

This may actually be due to the trainable (or other user code) being run in a different actor from the _WandbLoggingProcess, which calls wandb.init() (@krfricke can you confirm this?). If a print statement is added after wandb.init() is called in https://github.com/ray-project/ray/blob/e5c3f1cf3a7d582eb0671f194667f94b282d30d8/python/ray/air/callbacks/wandb.py#L173, it does show up in the logs tab of the W&B run when using WandbLoggerCallback.

nikitavemuri commented 1 year ago

Hi @dirkweissenborn, looks like the console output from the Tune Trial currently doesn't show up in the W&B logs tab because wandb.init() is called in a separate logging process if using the WandbLoggerCallback. An alternative is to use the wandb_mixin, which initializes W&B before running the training function.

bllchmbrs commented 1 year ago

Log Level might be a good thing to control log volume, since this could incur network cost. Also, having a toggle / configuration in the WandbLoggerCallback to configure whether logs should be exported to W&B.

dirkweissenborn commented 1 year ago

I wasn't using either the mixin nor the wandbloggercallback. Instead we just used wandb directly, initializing it directly in the remote function.

krfricke commented 1 year ago

@nikitavemuri can you share the reproduction example that does report the logs when used in actors? In your example above the remote task logs to wandb for me but the actor does not.

If the mixin is not used, Tune does not set any wandb-specific settings.

Tune does optionally allow to redirect output to a file, but this does not happen per default, and it will still also output to sys.stdout / sys.stderr.

krfricke commented 1 year ago

I've managed to confirm that actors do indeed log to wandb.

I think the main thing that might be happening here is that the actors are destroyed before the wandb background task synchronizes everything up. Wandb collects logs, metrics, and artifacts in local files and uploads them asynchronously.

For me the difference was the wandb.finish() call. Without it, I wouldn't see logs in tune. With it, I see them consistently.

This script shows logs in all four runs:

def train_fn(config):
    import wandb
    wandb.init()

    print("Inside print")
    import sys
    sys.stdout.write("Inside stdout")
    sys.stdout.flush()

    wandb.finish()  # important!

    return 4

from ray import tune

tune.run(
    train_fn,
    num_samples=4
)

@dirkweissenborn can you add wandb.finish() and let me know if this solves your problem?

dirkweissenborn commented 1 year ago

This shouldn't be the issue in our case, as the jobs are running for quite a long time. Anyway, we actually do call wandb.finish. We are using python logging though, could that make a difference?

scottsun94 commented 1 year ago

@dirkweissenborn It seems that this issue is not fixed yet?

rickyyx commented 1 year ago

I think we have it working? @nikitavemuri

dirkweissenborn commented 1 year ago

I still don't see logger output when running with tune in ray 2.0.

On Wed, Oct 26, 2022, 00:15 Ricky Xu @.***> wrote:

I think we have it working? @nikitavemuri https://github.com/nikitavemuri

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/25011#issuecomment-1291199831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKTUG4X5R2YPMZ7IWHU4ADWFBLYRANCNFSM5WOQLEFQ . You are receiving this because you were mentioned.Message ID: @.***>

scottsun94 commented 1 year ago

@krfricke Are you able to see the logs in W&B when using python logger?

This shouldn't be the issue in our case, as the jobs are running for quite a long time. Anyway, we actually do call wandb.finish. We are using python logging though, could that make a difference?