nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.63k stars 340 forks source link

Why are ratios not always 1? #12

Closed BigBadBurrow closed 4 years ago

BigBadBurrow commented 4 years ago

I've been looking over the code to get a better grasp of what it's doing, and the one thing that confuses me is in the update() method, why ratios aren't always 1?

The log probabilities are stored in memory which were obtained from policy_old, and then in update() it gets the log probabilities from policy via the evaluate() method, and the exp difference between them is the ratio. Afterwards policy_old weights are updated from policy, so they're the same. But if the same state is fed into exact copies of policy, then I don't understand why they'd produce different log probabilities? I'm obviously missing a piece of the puzzle, but I can't think what it is.

nikhilbarhate99 commented 4 years ago

Refer to #6

BigBadBurrow commented 4 years ago

Yeah, I realised that, but an agent will still learn with epoch = 1. Sooo.... I guess in that case it'd just be using the critic / advantage aspect rather than anything from policy optimization? Nice proof of concept thought that it can learn using only the critic part.