nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.63k stars 340 forks source link

Ratio Calculation #18

Closed murtazabasu closed 4 years ago

murtazabasu commented 4 years ago

Hello in this step here,ratios = torch.exp(logprobs - old_logprobs.detach()) where, you are detaching the grad from the old_logprobvariable. This is already performed in the previous step i.e. old_logprobs = torch.squeeze(torch.stack(memory.logprobs)).to(device).detach(). So should the ratios be like this ratios = torch.exp(logprobs - old_logprobs).detach() i.e. detaching the grads from the ratios ?

nikhilbarhate99 commented 4 years ago

Yes, I think you could write it as ratios = torch.exp(logprobs - old_logprobs) since old_logprobs are already detached, this would not make a difference. But we still need the graph for back propagation through the ratios to update the policy, so you can NOT do ratios = torch.exp(logprobs - old_logprobs).detach()