Closed keinccgithub closed 2 years ago
When I am looking deeply into your codes, I am curious about why you choose to decay the variance of actor network? We know that one of advantages of PPO is the stochastic policy it uses, and I just don't know why you set a determinate variance even though it decaying? Thanks!
I have the same question with you. On the website I cannot find relevant material.
My intuition is that we want to have a high variance at the start so that the policy can explore different possible actions. As the policy is learning, the variance is reduced to narrow down the exploration path and hopefully find the optimal behaviour. I'm still very much trying to understand it as well so it would be great if someone can actually explain it.
In my experiments I found that a constant variance (decaying linearly) is easily trainable, similar intuition as mentioned by @frettini. Most other implementations parametrize variance too, so the network predicts it. (I might add this in future, if I get time)
When I am looking deeply into your codes, I am curious about why you choose to decay the variance of actor network? We know that one of advantages of PPO is the stochastic policy it uses, and I just don't know why you set a determinate variance even though it decaying? Thanks!