openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.14k stars 2.23k forks source link

VPG: standardize advantages + standardize returns ? #9

Closed zuoxingdong closed 6 years ago

zuoxingdong commented 6 years ago

In VPG code, the advantage estimate is standardized, I guess the reason is it uses GAE ? I am wondering if it still gains benefits from standardizing returns which also often used in VPG implementation (when not using any bootstrapping), by standardizing returns, it might be viewed as making value function learn faster to fit a standardized 'dataset'.

  1. Is it correct that when using bootstrapping, it is better not to standardize returns anymore ? Because in the next iteration, when calculating bootstrapped returns, the reward at each time step is raw magnitude but the last state value is scaled due to the training in previous iteration.

  2. In VecNormalize, it turns out to be very helpful to accelerate training by standardize observations with a running average. But it also has an option to standardize rewards by a running average of episodic returns, I am thinking when using this option, it has the identical effect to standardize returns in principle ?

jachiam commented 6 years ago

Hi @zuoxingdong,

Advantage standardizing is included for two reasons: 1) because it is extremely common in RL algorithm implementations, and 2) it is mathematically principled (does not change the underlying algorithm) because it does not change the direction of the gradient in expectation. Because of the EGLP lemma, it's the same as adding a constant baseline. It rescales the gradient as well, but in TRPO and PPO that also doesn't make a difference at the theoretical level; the rescaling can only impact VPG, but in practice it doesn't seem to change that much.

With standardizing returns for learning a value function faster: you can do this (rllab does, for instance), as long as you unnormalize the value function when performing inference (otherwise you skew all of the RL computations). Some people think that this helps, although I have not seen a rigorous ablation analysis yet to prove this claim decisively.

For your specific questions:

  1. You can combine standardized targets (returns) in value learning with bootstrapping as long as you are careful about maintaining the correct meaning of the equations. It adds more programming work for you though.

  2. It has a similar effect to standardizing returns, although it's a bit different because (if I remember correctly, I haven't looked at VecNormalize in a little while) it doesn't take into account end-of-episode signals.

Either way, I thought that return normalization was slightly too complex for the Spinning Up use case (which is to say: being as helpful as possible for understanding how to transmute equations of deep RL into code), and so they are omitted from the implementations here. If you want to try including them as a hacking project, you should do it - and let us know how your experimental results look! :)

Hope this helps!