openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.14k stars 2.23k forks source link

strange behavior of VPG #128

Closed HareshKarnan closed 5 years ago

HareshKarnan commented 5 years ago

I noticed that when I use the interpolation equation

theta_new = theta_old + alpha*(theta_new-theta_old)

where theta are the parameters of the policy network

When i set the value of alpha equal to 5, or 10, the average reward is much higher than when I don't use the interpolation equation.

For example, for 50 epochs, on the invertedpendulum-v2 environment, i get an average reward of around 50. whereas if i use the interpolation equation with an alpha of 10, i get an average reward of 800. around 16 times better learning than the standard VPG ! . From what i understand, the policy gradient in VPG is getting the direction of the policy update right, but when i play around with the magnitude of the update, it performs better. TRPO handles this by computing the KL divergence, PPO handles this by using other fancy math, but the fundamental issue here with model free policy gradient techniques is that the policy gradient step is unstable because VPG is not sure about the magnitude of update. Am i in the right direction of thought process ?

jachiam commented 5 years ago

Hi @HareshMiriyala! Your thought process is correct: improving performance on VPG requires tuning the learning rate (as you are doing), and sometimes higher ones will work better but not always. Getting it right is tricky, which is why algorithms like TRPO and PPO, which are easier to tune, are more reliable.