Apologies if I am misunderstanding something but it seems that the direction of the KL divergence calculations used throughout the TRPO code seems to be at odds with the TRPO paper. Instead of KL[new params | old params], should we not be taking KL[old params | new params].
I think that you are right. Params are swapped in KL in this case. However, second order gradient for both KLs is the same so it is not an issue in practice.
Apologies if I am misunderstanding something but it seems that the direction of the KL divergence calculations used throughout the TRPO code seems to be at odds with the TRPO paper. Instead of KL[new params | old params], should we not be taking KL[old params | new params].
TRPO Code: https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/tf1/trpo/core.py#L136
TRPO Paper:
Moreover, should we not be taking the gradients with respect to the second set of parameters, not the first, as suggested by this part of the paper?