Closed murtazabasu closed 4 years ago
Yes, I think you could write it as ratios = torch.exp(logprobs - old_logprobs)
since old_logprobs
are already detached, this would not make a difference.
But we still need the graph for back propagation through the ratios to update the policy, so you can NOT do ratios = torch.exp(logprobs - old_logprobs).detach()
Hello in this step here,
ratios = torch.exp(logprobs - old_logprobs.detach())
where, you are detaching the grad from theold_logprob
variable. This is already performed in the previous step i.e.old_logprobs = torch.squeeze(torch.stack(memory.logprobs)).to(device).detach()
. So should the ratios be like thisratios = torch.exp(logprobs - old_logprobs).detach()
i.e. detaching the grads from theratios
?