[rllib] Introduce dynamic time-limits into training

roireshef commented 3 years ago

Describe your feature request

Some environments that can run for long periods are often not advised to do so to generate too long episodes that bias training. Instead, one would often like to artificially limit the length of episodes generated by marking an episode as Done but still bootstrap from the value function to generate a reward approximation. This can't happen within the environment implementation since it is not exposed to the policy and can't sample its value approximation.

see: https://arxiv.org/pdf/1712.00378.pdf for better motivation

At the moment, the only flexibility that exists is the done returned signal, which is binary. In the time-out case, done should definitely be True, but this ultimately means the episode terminates with the reward signal returned by the environment, which is insufficient (value bootstrapping is advised).

I've previously implemented and tested this to successfully work and train with A3C (with no GAE), by introducing a timeout signal into the info dictionary returned by the environment, capturing it by the policy utility functions to generate the appropriate advantage.

Finally, I'm wondering if this is better implemented in a different way already. If not, I suggest adding this as a feature to RLlib. I could open a PR for that, but I'll be waiting for some feedback to understand if this is the suggested practice or not.

ericl commented 3 years ago

Have you tried using the env settings?

    # === Environment Settings ===
    # Number of steps after which the episode is forced to terminate. Defaults
    # to `env.spec.max_episode_steps` (if present) for Gym envs.
    "horizon": None,
    # Calculate rewards but don't reset the environment when the horizon is
    # hit. This allows value estimation and RNN state to span across logical
    # episodes denoted by horizon. This only has an effect if horizon != inf.
    "soft_horizon": False,
    # Don't set 'done' at the end of the episode. Note that you still need to
    # set this if soft_horizon=True, unless your env is actually running
    # forever without returning done=True.
    "no_done_at_end": False,

It sounds like setting horizon=N and soft_horizon=True might do the trick. I think you don't need no_done_at_end, unless you really never want to see a done.

roireshef commented 3 years ago

@ericl actually I think I'm describing a different use case than what your proposal addresses. In my use case, the episode doesn't only terminate after N steps, but more importantly it terminates dynamically based on conditions when they are met. Just to illustrate, consider the following case:

The environment has an autonomous vehicle agent acting in it.
1. When the vehicle crashes, Done signal returns True and Reward returns some penalty.
2. When the vehicle succeeds in its mission, Done is again True and Reward returns some positive value
3. While it's driving, the maximal length of steps in an episode should be capped at N.
4. Also, while driving, the environment can also terminate at the point where the agent decided to come to a full stop (even before reaching N steps, which is an instance of what I called "dynamic conditions").

In 1-2 the simple API works and the intermediate reward returned by the Env implementation is the correct one to use for loss calculations (this is strictly the end of the episode), but in 3-4 the termination of the episode is due to "artificial/technical reasons" and you might want to bootstrap the value function to compute your loss, because the original problem is an "infinite horizon" one (let's just assume that for a second). The motivation behind this - in reality the AV agent is allowed to come to a full stop, but while training - you wouldn't like to generate a bias in your data where a high percentage of the experiences are of a full stop (same state, same action), but rather you'd like to create a balance across the cases in your data used for training.

In 3 - when the fixed horizon is reached, this is basically already handled by the current API quite OOTB, but there's no easy option to implement case 4. In previous Ray versions I used the info dictionary to carry another binary signal that indicated whether to bootstrap the value function or not, which worked great, and allowed teaching the agent's policy to break for long times when it was the optimal decision, without actually creating the bias in training data. But this was prior to the new View API, that saves time by removing communication boilerplate , specifically by not sending the info dict to the loss functions. Now it looks more involved to implement, which requires modifying Ray code, rather than just the user's custom code.

I hope it makes more sense now. Anyway, I was wondering if this is something you ever thought of, and whether I'm missing an already existing solution...?

ericl commented 3 years ago

Hmm if this is about a regression in the traj view API, I think that would be a bug. @sven1977 do you know why the info dict is no longer sent automatically?

sven1977 commented 3 years ago

Yes, there is a known bug related to the trajectory view API in 1.0.1 described here: https://github.com/ray-project/ray/issues/12509 The workaround would be to set "_use_trajectory_view_api" in your Trainer's config to False here (and you should see the info dict again). This was fixed in all versions >1.0.1.

I'll add this observation (missing info dict) to the issue.

roireshef commented 3 years ago

@sven1977 @ericl - Is there any documentation on how to use/customize the trajectory view api ? I'd like to add this flag that is passed from my environment in the info dict it returns so that _addadvantages will have an access to it. It's actually independent of the algo being used.

I understand the easiest way would be to disable the trajectory view api but AFAIK it's better to use this api to achieve better runtimes.

I'll appreciate your guidance on how to approach this.

sven1977 commented 3 years ago

Hey @roireshef , sorry for the delays here. The documentation PR will go into another round of review today: https://github.com/ray-project/ray/pull/12718

Again, not seeing the info dict in your loss function/postprocessing fn is simply a bug, which can be alleviated by running RLlib w/o the trajectory view API. In >=1.1.x, this should be fixed.

ray-project / ray

[rllib] Introduce dynamic time-limits into training #12490

Describe your feature request