nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.57k stars 332 forks source link

PPO with determinate variance #44

Closed keinccgithub closed 2 years ago

keinccgithub commented 3 years ago

When I am looking deeply into your codes, I am curious about why you choose to decay the variance of actor network? We know that one of advantages of PPO is the stochastic policy it uses, and I just don't know why you set a determinate variance even though it decaying? Thanks!

Alex-HaochenLi commented 3 years ago

When I am looking deeply into your codes, I am curious about why you choose to decay the variance of actor network? We know that one of advantages of PPO is the stochastic policy it uses, and I just don't know why you set a determinate variance even though it decaying? Thanks!

I have the same question with you. On the website I cannot find relevant material.

frettini commented 2 years ago

My intuition is that we want to have a high variance at the start so that the policy can explore different possible actions. As the policy is learning, the variance is reduced to narrow down the exploration path and hopefully find the optimal behaviour. I'm still very much trying to understand it as well so it would be great if someone can actually explain it.

nikhilbarhate99 commented 2 years ago

In my experiments I found that a constant variance (decaying linearly) is easily trainable, similar intuition as mentioned by @frettini. Most other implementations parametrize variance too, so the network predicts it. (I might add this in future, if I get time)