vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
4.96k stars 571 forks source link

PPO timeout proper handling #198

Open Howuhh opened 2 years ago

Howuhh commented 2 years ago

Hi! I'm a bit puzzled as to how a timeout could be handled correctly in your implementation of PPO (well, this is relevant for all variants). I am especially surprised by envpool, because seems like they do not return the last real state at all (like gym vec envs do in info).

Were there any ideas on how to do it right? I'd like to make a PR with that fix, but I haven't figured out how yet. Buffer in PPO is a bit more confusing than in off-policy algos where there is a clear (s, s', d) thing..

vwxyzjn commented 2 years ago

Proper timeout handling is definitely an interesting issue I want to look into further. I think the best resource is https://github.com/DLR-RM/stable-baselines3/pull/658. Happy to take a look into this together :)

If you are submitting a fix PR, I highly suggest doing it just on ppo.py to show a proof-of-concept before applying the changes to other PPO variants.

Howuhh commented 2 years ago

Thanks for the link! I'll take a look and see what I can come up with.

Howuhh commented 2 years ago

After some thinking and sketching on a piece of paper, it seems to me that it could be solved this way (just proposal for now):

# define buffer's here: states, actions, rewards, dones, values, logprobs
# Initial obs and info
state, prev_info = torch.Tensor(envs.reset()).to(device), {}

for step in range(0, args.num_steps):
    with torch.no_grad():
        action, logprob, _, value = agent.get_action_and_value(state)
    next_state, reward, done, info = envs.step(action.cpu().numpy())

    # full transition on step T
    states[step] = state
    actions[step] = action
    rewards[step] = torch.tensor(reward).to(device).view(-1)
    dones[step] = done
    # also PPO stuff for step T
    logprobs[step] = logprob
    values[step] = value.flatten()

    # Here we should check for timeout on previous step
    if step > 1 and torch.any(dones[step - 1]):
        # if on prev step was timeout, then we should
        # 1. set dones[step - 1][env_id] to False, as it was not real done
        # 2. set values[step][env_id] to V(prev_info['terminal_observation']) (as it is real next_state for previous step)
        for env_id in range(args.num_envs):
            timeout = "TimeLimit.truncated" in prev_info[env_id] and prev_info[env_id]["TimeLimit.truncated"]
            if timeout:
                terminal_state = torch.tensor(prev_info[env_id]["terminal_observation"], device=device)
                # Set done to false as it was timeout
                dones[step - 1, env_id] = 0 
                # Set value to V of terminal_state (not state as it is first state after reset on previous step)
                values[step, env_id] = agent.get_value(terminal_state).flatten() 

    state, prev_info = torch.Tensor(next_state).to(device), info

This small rearrangement, of course, would require separately handling the last state after num_steps, since we need it to bootstrap the last transition, but it could be done in similar way. GAE computation then would be unchanged at all!

The main point here is that we never use value[step] to compute anything for state[step], only for state[step - 1]:

r[step - 1] + gamma * not_done[step - 1] * value[step]

So it seems to me that we can safely swap it inplace here. Does that sound reasonable to you?

vwxyzjn commented 2 years ago

Thanks for the sketch. I think it's a little tricky though so I wrote down what the variables look like as below (link)

image

I think the correct implementation is consistent with https://github.com/DLR-RM/stable-baselines3/pull/658/files#diff-384b5f21f2bed58d1d6e64da04a42fee52f353fcec38bf410338524336657bd8R205


            # Handle timeout by bootstraping with value function
            # see GitHub issue #633
            for idx, done_ in enumerate(dones):
                if (
                    done_
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]
                    rewards[idx] += self.gamma * terminal_value

Is this equivalent to your implementation? I didn't quite get the reason behind step - 1 in the sketch.

Next week will be pretty busy but will try to respond.

CC @araffin

Howuhh commented 2 years ago

After thinking a bit more, it seems that my implementation is not correct because it will not correctly account for the done flag in GAE (last_gae_lam in example 2 from @Miffyli in https://github.com/DLR-RM/stable-baselines3/issues/633, while delta computation will be ok) so your version is more correct.

I don't like the fact that we're implicitly changing the rewards, though, since in other PPO variants they could be used for something else :(

I will still try to explain the thinking behind step - 1. Suppose we made 3 steps: $s_0, r_0, d_0, v(s_0), s_1, r_1, d_1, v(s_1), s_2, r_2, d_2, v(s_2)$ and $d_1$ is done and timeout.

We have two variants of value computation for step 1:

  1. $r_1$, as $d_1$ is True - false
  2. $r_1 + \gamma v(s_2)$ - correct, but $s_2$ here will be first state after reset from next episode, so we should correct it to.

Because we will add $v(s_2)$ to the buffer on step 2, we should check if $d_1$ (where step - 1 comes from) is timeout and if it is then change:

  1. $d_1$ -> False
  2. $v(s2)$ -> $v(s{2 \ terminal})$

Then value estimation will be ok: $r1 + \gamma v(s{2 \ terminal})$

Actually you do the same as done in our code will be added to the buffer in the next iteration of the loop (so it is step - 1). To be honest it confuses me every time since it is not intuitive, that we add to the buffer something (done to be specific) from the last step (and why?). In my example (and code in my projects) I add done on the same step. https://github.com/vwxyzjn/cleanrl/blob/ee262daaafb0b9b5baf0f7b89190b28edfe4760c/cleanrl/ppo_continuous_action.py#L203

For now I think your proposal is better, I will try to make PR & some runs on gym Mujoco.

vwxyzjn commented 2 years ago

I don't like the fact that we're implicitly changing the rewards, though, since in other PPO variants they could be used for something else :(

That's a good point. One way we could address is to create another storage variable terminal_values with the same shapes as values, and we can utilize this terminal_values during the return and advantage calculation.

Also a quick note: the new gym API for truncation is coming soon https://github.com/openai/gym/pull/2752. Ideally we should be prototyping using the new API.

I will still try to explain the thinking behind ...

One issue with the example you provided is storage - $v(s_2)$ is also useful and needs to be stored somewhere.

Actually you do the same as done in our code will be added to the buffer in the next iteration of the loop (so it is step - 1). To be honest it confuses me every time since it is not intuitive,

next_obs and next_done serve a very intricate role. Please see the "vectorized architecture" section in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for a detailed explanation.

1B58CB6E-51B0-47FF-A45D-981C7E1783FA

vwxyzjn commented 2 years ago

Here is a more complete walkthrough

image
FelipeMartins96 commented 1 year ago

Found this issue as I was wondering if the isaacgym ppo was handling timeouts; it is not, right? I will check the code more thoroughly tomorrow and try to find a solution for my use case.

Additionally, I remember having issues on isaacgymenvs on how they handle the timeouts and resets; I think my env diverges a bit from theirs, so I'm not sure if it is handling their case but not mine

vwxyzjn commented 1 year ago

Yeah the isaacgym ppo variant does not deal with truncation properly… I sort of lost a bit of interest in it because properly handling it did not seem to result in significant performance difference. See https://github.com/sail-sg/envpool/issues/194#issuecomment-1317171243

FelipeMartins96 commented 1 year ago

@vwxyzjn did you notice any significant overhead by handling truncations when benchmarking?

vwxyzjn commented 1 year ago

Not really significant overhead.

gmarkkula commented 3 months ago

Just chiming in here to agree with this comment in a related thread: I have experienced that proper timeout handling can make quite a difference. Thought I'd share my experience and solution here in case it's useful to others, and to encourage getting it implemented in CleanRL. (In relation to the discussions above I am not sure about the best way to do this though.)

Anyway: I have a vectorised environment that I was getting good results with from SB3, but to speed things up I wanted to move both environment and RL to GPU, building from CleanRL ppo_continuous_action_isaacgym.py. However, the CleanRL PPO did not replicate the good SB3 PPO results, After much head-scratching, I discovered that disabling the timeout handling in SB3 (by no longer reporting truncations and not passing terminal observations) got SB3 stuck in the same suboptimum as CleanRL, and after adding bootstrapping at truncation to the CleanRL implementation I now get very similar good results with CleanRL to what I was getting with SB3 out of the box.

For reference, in my environment the agent can incur a lot of effort penalties early on while it's trying to learn how to get positive rewards, so early on rewards will be mostly negative. My understanding of what was happening here is that without truncation bootstrapping, just giving up and doing nothing becomes a very strong suboptimum, because every now and then the agent still gets a random (from its Markov perspective) treat of reaching the episode truncation, where without bootstrapping the reached state seems to have about zero value, which is much better than the negative values incurred when trying to learn. And the value function learning becomes noisy and slow, because there is this weird zero value that happens unpredictably every now and then. (Apols if this is obvious to others - it took me a while to suss it out.)

I based my bootstrap implementation on the SB3 one, as referenced in Costa's comment above, changing this line:

            next_obs, rewards[step], next_done, info = envs.step(action)

to:

            next_obs, rewards[step], terminations, truncations, info = envs.step(action)
            next_done = (terminations | truncations).long()
            with torch.no_grad():
                terminal_values = agent.critic(info['final_observation']).squeeze()
            rewards[step] = torch.where(
                truncations, rewards[step] + args.gamma * terminal_values, rewards[step])

This approach directly modifies the rewards, as discussed in the thread above, but that doesn't matter to me in this case.

Note that I am using my own environment here, not an IsaacGym environment, so the info['final_observation'] is created by me in my environment class. Based on my (possibly imperfect?) interpretation of gymnasium.vector.VecEnv I am passing this as the observation tensor (same shape as next_obs), but with the information from before terminated/truncated environments have been reset.