Open rarilurelo opened 7 years ago
Depending on how large the difference is. If it's e.g. around 1e-8 it might be just numerical precision issue.
If the KL was equal to zero, that is numerical precision error didn't occur, a differential of the KL with respect to parameters would be zero. And hessian vector products also would be zero. Does this implementation depend on numerical precision issue? It sounds strange for me.
Hi! MeanKLBefore is defined at optimize_policy in npo.py
I think the policy1 which samples action for collecting trajectory is strictly equal to the policy2 for computing KL, so KL(policy1||policy2) (MeanKLBefore) should be equal to zero. However, it has slightly value. To confirm difference between policy1 and policy2 I ran a example code trpo_gym.py and inserted print debug.
The result
The result represents that MeanKLBefore is not equal to zero because of the difference of mean. My question is what causes the difference of mean?
Thanks for your help!