TRPO to be tuned - Githubissues

So I implemented a TRPO algorithm that tries to replicate the author's code with the help of tensorflow anyway. We can run the demo here. I think we can reuse as much existing code as possible, and also to reduce the dependency on any specific deep learning framework. The algorithm now looks good, which could achieve a score of 1000 in Inverted-Pendulum-v1.

Remaining works to be done:

Haven't tested in other environments, there could potentially be bugs/difference from the original implementation
Don't know why all the l-bfgs-b algorithm terminates with "STOP: TOTAL NO. of ITERATIONS EXCEEDS LIMIT"
Don't know why the Conjugate Gradient algorithm always reach maximum iteration without reducing the residual norm within 1e-10
Parallelize the actor-learners by multi-environment single-agent training. This requires redesign of the architecture.

@peterzcc @flyers After reading the TRPO paper, I find that they've used the analytical estimator of the fisher information matrix (Section 6, Paragraph1). The traditional estimator (A = G G^T) could also be used and has been tested in the paper, which has similar performance as the analytical estimator ( Figure 4, Empirical FIM V.S Vine).

I feel that the traditional estimator could be better than the analytical estimator. G G^T will naturally be p.s.d while the analytical version will have negative eigenvalues in non-convex case (Hessian will be positive semi-definite for convex functions, but can be not p.s.d for non-convex functions). We can replace the analytical estimator in the program to empirical estimator and test the efficiency and performance.

peterzcc / Arena

TRPO to be tuned #18