Closed vuoristo closed 4 years ago
I checked the behavior here again and I found that for some reason, on the first iteration when hessian_vector_product
is called, the grad2s
are slightly different from what they would be without this change. The difference is small with standard deviation of 1e-8 across all of the policy parameters. After the first call to hessian_vector_product
the difference disappears.
I ran the algorithm on HalfCheetahDir-v1 a couple of times with and without this change and the results look similar. However, it's not exactly the same so I'll investigate why such difference exists in the first place.
Looks like figuring out the exact difference between the two implementations is a bit more complicated than I expected. I don't have time to pursue the investigation further now, so feel free to close this.
If you are interested, I was able to narrow down the problem a little bit. The problem is that stepdir
computed in metalearner.step
has a small difference between the two implementations on the first time it is computed. The difference disappears on later calls to metalearner.step
.
first_order=True
in metalearner.adapt
eliminates the difference. hessian_vector_product
seems to eliminate the difference for some reason. And I mean literally just running the following code twice.
kl = self.kl_divergence(episodes)
grads = torch.autograd.grad(kl, self.policy.parameters(), create_graph=True)
stepdir
computed using the different implementations for Bandit-K5-v0
environment.Based on these findings I think computing the gradients of the kl divergence for the normal_mlp
policy leaves some bit of state dangling somewhere in the computational graph, that is then used a bit inconsistently in computing the gradients the next time.
I have included this patch in the newest version of the code.
KL Divergence gradients don't have to be recomputed on every iteration of conjugate_gradient
This results in ~ 10 % speedup of the overall algorithm on my setup.