strange behavior of VPG

I noticed that when I use the interpolation equation

theta_new = theta_old + alpha*(theta_new-theta_old)

where theta are the parameters of the policy network

When i set the value of alpha equal to 5, or 10, the average reward is much higher than when I don't use the interpolation equation.

For example, for 50 epochs, on the invertedpendulum-v2 environment, i get an average reward of around 50. whereas if i use the interpolation equation with an alpha of 10, i get an average reward of 800. around 16 times better learning than the standard VPG ! . From what i understand, the policy gradient in VPG is getting the direction of the policy update right, but when i play around with the magnitude of the update, it performs better. TRPO handles this by computing the KL divergence, PPO handles this by using other fancy math, but the fundamental issue here with model free policy gradient techniques is that the policy gradient step is unstable because VPG is not sure about the magnitude of update. Am i in the right direction of thought process ?

openai / spinningup

strange behavior of VPG #128