Closed HareshKarnan closed 5 years ago
Hi @HareshMiriyala! Your thought process is correct: improving performance on VPG requires tuning the learning rate (as you are doing), and sometimes higher ones will work better but not always. Getting it right is tricky, which is why algorithms like TRPO and PPO, which are easier to tune, are more reliable.
I noticed that when I use the interpolation equation
theta_new = theta_old + alpha*(theta_new-theta_old)
where theta are the parameters of the policy network
When i set the value of alpha equal to 5, or 10, the average reward is much higher than when I don't use the interpolation equation.
For example, for 50 epochs, on the invertedpendulum-v2 environment, i get an average reward of around 50. whereas if i use the interpolation equation with an alpha of 10, i get an average reward of 800. around 16 times better learning than the standard VPG ! . From what i understand, the policy gradient in VPG is getting the direction of the policy update right, but when i play around with the magnitude of the update, it performs better. TRPO handles this by computing the KL divergence, PPO handles this by using other fancy math, but the fundamental issue here with model free policy gradient techniques is that the policy gradient step is unstable because VPG is not sure about the magnitude of update. Am i in the right direction of thought process ?