Closed thiagopbueno closed 5 years ago
It seems that in RLlib actor-critic algorithms the Value function is jointly optimized with the policy performance (e.g., A3CLoss).
@angelolovatto Is there a way to decouple those objectives so to have a more flexible way of optimizing them (e.g., running a number of iterations for value fitting before using it as a baseline function in the policy gradient objective) ?
We could implement this in a similar manner to what is done in the keras_policy implementation (which looks more like a template than anything else).
One possible issue with using separate keras
models for the policy, value function and transition model is that each would most likely define its own session. This could be a problem when specifying resources, since tf_session_args
(from COMMON_CONFIG) would most likely not be passed to each model automatically. I believe we can set the session for each model as the session of the TFPolicy
though
Implemented in #19 .
Build ops for computing the loss using MonteCarlo returns and the update step of gradient descent over the Value function parameters given a TensorFlow optimizer and a batch of experiences.