Open refraction-ray opened 4 years ago
@apoorvnandan please take a look.
@refraction-ray I apologize for the late reply and thank you for pointing this out.
While I understand your reasoning, I think the textbook is not explicitly clear in this particular case, i.e. where the actor and critic make their predictions from a shared representation of the state.
If the actor and critic had no shared parameters, what you have pointed out is extremely clear. (and is automatically taken care of as differentiation of anything from the critic's value
w.r.t actors parameters would be zero)
Here, we have some shared parameters which are getting updated from gradients coming from actor loss (containing the value
component) and critic loss. The other parameters specific to the actor and critic part of the network get updated only from the correct gradients as per theory.
Therefore, I think it is fine to update the shared parameters (which can be thought of as an encoder for the state
input as opposed to actor/critic specific parameters) using gradients coming from the value
component in the actor loss (along with those coming from critic's loss). Your recommended change will totally work as well. I just couldn't conclude that it is compulsory for the shared parameters. Please let me know if I am missing something obvious.
Empirically, over 7 runs, the number of training steps required to solve the task:
mean = 394, std = 203.6
mean = 385, std = 118.5
Given the high standard deviation, I don't see a great advantage of the suggested change, but it does seem to perform slightly better.
@apoorvnandan , thanks for your detailed reply and benchmark results.
My experiment data on 10 runs (steps required):
And if the required score is 150 instead, (relax the winning condition, still 10 runs each):
Therefore, I believe the suggested change can improve the performance, it reduce both the value and the fluctuation of expected steps. One can try some more extensive and rigorous comparisons between the two cases if interested.
From theoretical perspective, the question is not formulated as:
Given the loss function $$\mathcal{L} = \ln\pi (reward-value)$$, whether we should differentiate value besides $$\ln\pi$$.
Instead, $$\nabla L = \sum\nabla\ln \pi_i (reward-value)$$ is the first class citizen in REINFORCE, which is the correct formula for gradients derived ab initio. The post I referenced in this issue is clear about this point. The formula for $$L$$ is only a shortcut (i.e. score function) for this correct gradient formula as long as reward and value has no dependence on the trainable parameters.
If the theory part still seems vague or ambiguous, at least we can carry out extensive numerical experiments to compare the performance difference using different hyperparameters or even different environments. I believe that detaching value part is better in both theory and experiments.
at least we can carry out extensive numerical experiments to compare the performance difference using different hyperparameters or even different environments.
Totally agree. I ran the script for 50 runs with each method. The difference between the two methods become clear with the new numbers.
I also modified the script for a few other environments and the approach seems to work better there as well.
The lower standard deviation and mean show the suggested change makes the script better. I'll create a PR and make the change. Thanks again for bringing this to my notice.
This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.
This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.
In this example https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py, the gradient for the actor is defined as the gradient of loss $L = \sum \ln\pi (reward-value)$.
However, since value is also directly dependent on model.variables, the gradient of such loss is not the gradient from the textbook as $\nabla L = \sum \nabla(\ln\pi) (reward-value)$. For a detailed derivation on the correct gradient formula in this case, see https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/.
Therefore, in the implementation, one should use
diff = ret - tf.stop_gradient(value)
(L147) instead of plaindiff = ret - value
to cut down the gradient backpropagation viavalue
. And in experiments, the change can indeed greatly reduce the training steps with the same hyperparams and setup.Actually the similar example from tensorflow seems to have the same type of issue: https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic.