openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.68k stars 4.86k forks source link

Gamma in VecNormalize for rms updates. #538

Closed pisiiki closed 6 years ago

pisiiki commented 6 years ago

I have observed that the VecNormalize class updates the reward running mean statistics with what looks to me like a discounted reward:

def step_wait(self):
    obs, rews, news, infos = self.venv.step_wait()
    self.ret = self.ret * self.gamma + rews # here
    obs = self._obfilt(obs)
    if self.ret_rms:
        self.ret_rms.update(self.ret)
        rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
    return obs, rews, news, infos

I can't see why this helps at all, I would directly use rews in the rms update. Am I missing something here?

Thanks.

pzhokhov commented 6 years ago

it looks like the statistics are collected for episode returns (not for rewards), but I don't really know what's benefit of using that for the reward normalization vs using stats of the rewards themselves. Also, in that case, there may be a bug where you'd need to reset self.ret to 0 at the beginning of the episode, here: https://github.com/openai/baselines/blob/92b9a3725710a20ec62120e7b1533afc84a3df27/baselines/common/vec_env/vec_normalize.py#L39. @joschu @unixpickle could you comment please?

unixpickle commented 6 years ago

Normalization is important because neural networks trained with Adam (and most other algorithms, as well) don't learn well with very large or very small targets. Returns are what we really want to normalize, because they are ultimately what drive the value function and policy.

To see why normalizing returns is different than normalizing rewards, take the simple case where the rewards are large but cancel each other out. For example, the rewards might be [100, -101, 102, -99, ...]. In this case, the returns will be fairly small, even though the rewards are large. Thus, the advantages we feed into the RL algorithm will already be fairly well-behaved. In a different environment, the rewards might be [10, 10, 10, 10, ...]. In this case, the rewards are not that large, but the returns are likely to be on the order of 1000 if we use gamma=0.99.

pzhokhov commented 6 years ago

thanks @unixpickle! Do you agree with resetting returns to 0 in reset method, or is there some reason why we should not?

andytwigg commented 6 years ago

I agree it makes sense to normalize returns. The current method effectively applies a sliding window. I do have two concerns:

1) with gamma=0.99 this window is effectively of length ~100 which may be less than a single episode, so the effect of normalizing across episodes is lost.

2) we are dividing by std without subtracting mean. But the result is clipped symmetrically around 0.

My intuition is that we want to average across a much larger window and also subtract the mean. Have I missed something?

Andy

atwigg.com

On Mon, Aug 27, 2018, 5:49 PM pzhokhov notifications@github.com wrote:

thanks @unixpickle https://github.com/unixpickle! Do you agree with resetting returns to 0 in reset method, or is there some reason why we should not?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/538#issuecomment-416290557, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXttWheR8t8zH0uq5i1FJ9NIbkDPQoAks5uVCMGgaJpZM4WK6Q3 .

unixpickle commented 6 years ago

Regarding 1), the gamma corresponds to the discount factor used in the RL algorithm. We want to normalize rewards so that the advantages in the RL algorithm have a nice magnitude, meaning that we should use the same gamma that's used to compute the advantages. The returns are being averaged over the entire course of training by ret_rms, so the gamma doesn't directly effect how fast our normalization coefficient is changing.

Regarding 2), subtracting the mean from the rewards would change the dynamics of the environment. For example, if your rewards are all 1, that is very different than if your rewards are all 0. In the former case, the agent wants to live; in the latter, it doesn't care if it lives or dies. The reward clipping looks like a bit of a hack, and there's probably no way to do it truly "correctly". Centering the clip range at the mean reward wouldn't quite make sense, I don't think, and I can't think of a perfect way to do it.

andytwigg commented 6 years ago

Regarding 1), the gamma corresponds to the discount factor used in the RL algorithm.

I see. So we want the discounted rewards computed in the RL algorithm to have unit variance. I guess there is a small concern that the RL algorithm discounts 'backwards' and this method discounts in the opposite direction, but perhaps in practise this isn't too much of an issue? In this case, I assume we need to set ret=0 in reset() as per @pzhokhov's comment?

pisiiki commented 6 years ago

Thanks @pzhokhov, @unixpickle, @andytwigg, lets see if I got this well enough:

If I'm correct I guess the code just needs to update self.ret to 0 acordingly, not just in reset but also in step_wait (in case of terminal/absorbing state). This won't reset running avg./var. returns, only per env returns, so returns from a previous episode won't leak into the next one causing the first steps to be a bit off.

Regards.

andytwigg commented 6 years ago

If I'm correct I guess the code just needs to update self.ret to 0 acordingly, not just in reset but also in step_wait (in case of terminal/absorbing state).

Yes, that's a good point.

andytwigg commented 6 years ago

@pzhokhov this looks good now, can the PR be merged?