Implementing Natural Policy Gradient based algorithm including TRPO

peterzcc / Arena

0 stars 1 forks source link

Implementing Natural Policy Gradient based algorithm including TRPO #17

Closed peterzcc closed 7 years ago

peterzcc commented 7 years ago

I will start implementing a TRPO with GAE algorithm today. The conjugate gradient optimization algorithm will first be implemented using python API, which could be optimized later.

peterzcc commented 7 years ago

I think the implementation here seems sufficient for now.

peterzcc commented 7 years ago

Something that could be challenging to implement is the Fisher Vector Product. In this section, we need to have a second-order gradient. However, we don't have a symbol that computes the Gradient in MXNet, we may need to perform backward() manually then. Thinking of a way to realize it.

flyers commented 7 years ago

Yes, computing second-order gradient is the tricky part. Looks like that we have to do it manually in mxnet.

peterzcc commented 7 years ago

One thing I don't understand is that in the rllab's implementation, it seems like they use a different method. However the rllab's conjugate gradient implementation is not very readable and I didn't fully understand it.

sxjscience commented 7 years ago

Automatically computing Hessian matrix is currently not supported in MXNet. Could we use some simplified version that only needs the first-order differential?

flyers commented 7 years ago

@peterzcc You can refer to the supplementary material of the original TRPO paper. It introduces two different implementation methods. The difference is how to compute the Fisher Vector Product. The first one use the Fisher Information Matrix to compute, which has simple form for some specific distributions. The second one just uses more generic Hessian-vector product. rllab's implementation uses the second one since computing Hessian is done by theano. However, I think in mxnet we have to use the first one.