ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.47k stars 5.68k forks source link

[RLlib] Tensorboard iterations do not start at zero #30471

Closed simonsays1980 closed 1 year ago

simonsays1980 commented 1 year ago

What happened + What you expected to happen

What happened

When running RLlib experiments with Tune iterations on Tensorboard start at the timestep on which the first iteration is finished instead of at zero.

What I expected to happen

I expected that Tensorboard starts at an initial timestep 0 with rewards from the random policy.

Versions / Dependencies

Fedora 36 Python 3.9.0 Ray 2.0.1

Reproduction script

import ray
from ray import air, tune

ray.init()
tune.Tuner(
    "PPO",
    run_config=air.RunConfig(stop={"episode_reward_mean": 10},),
    param_space={
        "env": "CartPole-v0",
        "num_gpus": 0,
        "num_workers": 1,
        "lr": tune.grid_search([0.01, 0.001, 0.0001]),
    },
).fit()

Then opening Tensorboard should show that iterations start at the first training iteration (with around 4000 timesteps.

Issue Severity

Low: It annoys or frustrates me.

sven1977 commented 1 year ago

Hey @simonsays1980 , thanks for filing this issue. I get your way of thinking about this problem. The x-axis (timesteps) should start at 0. The answer is quite simple, though, I think: There are no results to be "tensorboard'd" at ts=0, neither training results (no model updates yet), nor sampling results (no train batch collected yet). So the first data is only available AFTER the first parallel sampling call on all rollout workers. But at that point, we already have set our ts counter to 4000 (or whatever train_batch_size is). So in other words: ts=4000 -> rewards are reported from the randomly initialized, not yet updated policy; this is the one used to collect the first 4000 ts from the environment to form the 1st train batch. ts=8000 -> rewards are reported from the policy that has received 1 update thus far; this is the one used to collect the ts 4000-8000 from the environment to form the 2nd train batch. etc..

simonsays1980 commented 1 year ago

Hey @simonsays1980 , thanks for filing this issue. I get your way of thinking about this problem. The x-axis (timesteps) should start at 0. The answer is quite simple, though, I think: There are no results to be "tensorboard'd" at ts=0, neither training results (no model updates yet), nor sampling results (no train batch collected yet). So the first data is only available AFTER the first parallel sampling call on all rollout workers. But at that point, we already have set our ts counter to 4000 (or whatever train_batch_size is). So in other words: ts=4000 -> rewards are reported from the randomly initialized, not yet updated policy; this is the one used to collect the first 4000 ts from the environment to form the 1st train batch. ts=8000 -> rewards are reported from the policy that has received 1 update thus far; this is the one used to collect the ts 4000-8000 from the environment to form the 2nd train batch. etc..

Hi @sven1977 , thanks for coming back to this and your elaborate answer. It's logical and makes fully sense when considering how RLlib is working internally. And it is also clear now that actually the results at ts 4000 are from the "random" initial policy, before any training iteration has happened. As we are talking about Tensorboard here, I cannot resist to ask :): is there a Callback that tracks gradients and/or clipping? That would be quite helpful for debugging. Are gradients always available at the same place for all algos (like ready to be snacked at a certain point by a stats_fn or a callback)?

sven1977 commented 1 year ago

Hey @simonsays1980 , glad to help here :) In order to actually fix this (reporting random performance at t=0), we would have to separate the learning stats from the sampling stats, which we might actually do at some point.

On the callbacks question: Yes, there should be the on_train_results callback, which says:

    def on_train_result(
        self,
        *,
        algorithm: "Algorithm",
        result: dict,
        **kwargs,
    ) -> None:
        """Called at the end of Algorithm.train().

        Args:
            algorithm: Current Algorithm instance.
            result: Dict of results returned from Algorithm.train() call.
                You can mutate this object to add additional metrics.
            kwargs: Forward compatibility placeholder.
        """
        pass

In the result arg, you should be able to find the return values of the stats_fn of your algo. You are right, each algo somewhat publishes different things from the stats_fn, which we should polish a little to get at least some common structure into these.

Hope this helps. At some point, I would like to clean up the entire metrics handling:

simonsays1980 commented 1 year ago

Hi @sven1977 , thanks for the detailed answer. I appreciate the ideas for the metrics' future. Unifying metrics across algorithms makes a lot of sense (at least in form of objects - as different algorithms need of course different metrics).

I think there was maybe such an idea of disseminating sampling and training metrics before, but was not implemented in the full range as needed (this is what I understood in the issue #24731).

I have looked into the on_train_results() callback and the result dictionary therein. These are basically the metrics shown in TensorBoard and, if we want to enrich this dictionary or write out some values for gradients (for example), we somehow need access to these, but grads_and_vars is not there. Instead, this is existent in the policies. In the EagerTFPolicyV2 this is not a class attribute, so these information are not accessible.

I am thinking about good debug methods for cases like the following:

ray_tune_info_learner_default_policy_learner_stats_policy_loss

If such situations happen, we want to further investigate the cause of it. In RLlib this is still a lot of hustle. Part of it is due to Ray running distributed - but we still have the local_mode=True parameter, we can use for this (even though it will change a little the events trajectory). Another part is caused by TensorFlow making it harder to debug code - but TF2 could be used (again this will of course change events and execution code a little).

There remains a part that is fully in our hands here and this is: how to get certain variables to enable the user to investigate what has exactly happened at a point of error (this is actually a great part of the work of users). To access, however, such variables at a specific point (in the example above after 4 million timesteps!) it needs certain callbacks that can execute as soon as some values are above limits and then maybe write out values or break for a debugger.

I have not that detailed code overview to see where are some general access points for such callbacks with access to the most needed variables (intuitively somewhere where the loss calculation happens that is needed for the update). Maybe there exist already good practices inside the RLlib team to debug such situations. My own method is to write code into the RLlib files (like the eager_tf_policy_v2.py) with conditions like

if any(loss > 1000): 
    np.save("loss.npy", loss.numpy())
    ..... 

to get some insights. Using

import ray
breakpoint()

can help at some points, but does not execute in the if -conditions as shown above. Furthermore, the ray debugger can be tricky as some of you might have found out the bad way :) My VSCode debugger does error out, if setting conditional breakpoints (also because it needs numpy to be loaded in the conditional breakpoint to execute loss.numpy()).

I am still not at a point where I am consent with the available options for debugging such situations as shown in the Tensorboard graphic above. Debugging such situations still takes sometimes several days to get to the cause of the error and this in many situations even expensive in business.

simonsays1980 commented 1 year ago

@sven1977 I will close this now as this was solved in the first comment and the rest of the discussion is actually followed on in the linked issue/PRs