Closed kierkegaard13 closed 5 years ago
The policy
and policy_old
will be identical for the first iteration of the update loop. At the end of the iteration we update the policy
and not policy_old
, so in further iterations they are not identical. After performing K
updates the updated policy
is loaded into policy_old
.
Refer to PPO
Cool, I think I see now. Thanks for the link.
I'm a bit new to PPO, but I think this line is out of place in PPO.py:
I think it should be placed before the for loop, because otherwise policy and policy_old will be identical and the log probs should evaluate to the same value.