Build ops for optimizing Value function parameters using Monte-Carlo returns

thiagopbueno / model-aware-policy-optimization

MAPO: Model-Aware Policy Optimization algorithm

GNU General Public License v3.0

1 stars 0 forks source link

Build ops for optimizing Value function parameters using Monte-Carlo returns #7

Closed thiagopbueno closed 5 years ago

thiagopbueno commented 5 years ago

Build ops for computing the loss using MonteCarlo returns and the update step of gradient descent over the Value function parameters given a TensorFlow optimizer and a batch of experiences.

thiagopbueno commented 5 years ago

It seems that in RLlib actor-critic algorithms the Value function is jointly optimized with the policy performance (e.g., A3CLoss).

@angelolovatto Is there a way to decouple those objectives so to have a more flexible way of optimizing them (e.g., running a number of iterations for value fitting before using it as a baseline function in the policy gradient objective) ?

0xangelo commented 5 years ago

We could implement this in a similar manner to what is done in the keras_policy implementation (which looks more like a template than anything else).

One possible issue with using separate keras models for the policy, value function and transition model is that each would most likely define its own session. This could be a problem when specifying resources, since tf_session_args (from COMMON_CONFIG) would most likely not be passed to each model automatically. I believe we can set the session for each model as the session of the TFPolicy though

thiagopbueno commented 5 years ago

Implemented in #19 .