Closed biggzlar closed 4 years ago
Yes, you are correct, we could get rid of the policy_old
.
I wrote it this way to maintain similar notations from the paper for understanding purposes.
Gotcha, it does serve a purpose then. Thanks for the reply!
May be useful for someone who wants to know why in PPO paper they kept the pi_old. This pi_old is used for a KL penalized objective. This objective function was a baseline of the clipped surrogate objective.
https://github.com/nikhilbarhate99/PPO-PyTorch/blob/64376aabb07d668573bf63399e3cafc8ee663e9c/PPO.py#L127
Why do we even maintain two policies? The old policy has produced the old action distributions so, at the point where we compute the ratios during updates, we don't need it.
Old_log_probs might as well have been generated by the normal policy, then during updates we evaluate only the updated policy, compute the ratios with previously saved log_probs and we are golden. At no point do we need both at once.
Am I missing something?