nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.63k stars 340 forks source link

Why maintain two policies? #14

Closed biggzlar closed 4 years ago

biggzlar commented 4 years ago

https://github.com/nikhilbarhate99/PPO-PyTorch/blob/64376aabb07d668573bf63399e3cafc8ee663e9c/PPO.py#L127

Why do we even maintain two policies? The old policy has produced the old action distributions so, at the point where we compute the ratios during updates, we don't need it.

Old_log_probs might as well have been generated by the normal policy, then during updates we evaluate only the updated policy, compute the ratios with previously saved log_probs and we are golden. At no point do we need both at once.

Am I missing something?

nikhilbarhate99 commented 4 years ago

Yes, you are correct, we could get rid of the policy_old. I wrote it this way to maintain similar notations from the paper for understanding purposes.

biggzlar commented 4 years ago

Gotcha, it does serve a purpose then. Thanks for the reply!

ghost commented 4 years ago

image

May be useful for someone who wants to know why in PPO paper they kept the pi_old. This pi_old is used for a KL penalized objective. This objective function was a baseline of the clipped surrogate objective.