nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.63k stars 340 forks source link

Implementation issues #6

Closed kierkegaard13 closed 5 years ago

kierkegaard13 commented 5 years ago

I'm a bit new to PPO, but I think this line is out of place in PPO.py:

123        self.policy_old.load_state_dict(self.policy.state_dict())

I think it should be placed before the for loop, because otherwise policy and policy_old will be identical and the log probs should evaluate to the same value.

nikhilbarhate99 commented 5 years ago

The policy and policy_old will be identical for the first iteration of the update loop. At the end of the iteration we update the policy and not policy_old, so in further iterations they are not identical. After performing K updates the updated policy is loaded into policy_old.

Refer to PPO

kierkegaard13 commented 5 years ago

Cool, I think I see now. Thanks for the link.