# policy entropy
entropy = -tf.reduce_sum(self.pi * log_pi, reduction_indices=1)
# policy loss (output) (Adding minus, because the original paper's objective function is for gradient ascent, but we use gradient descent optimizer.)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.multiply( log_pi, self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )
So it's like policy_loss = -log(pi) * a + beta * entropy. In this case, entropy would be minimized.However, entropy should be maximized to avoid convergence for exploration. Original paper says,
Thus, correct policy loss should be policy_loss = -log(pi) * a - beta * entropy.
If I am wrong, just please close this issue. I hope this helps you improve this implementation.
Hi, I'm reading this repository to implement my own A3C. Then I found policy loss incorrect.
Current policy loss is at https://github.com/miyosuda/async_deep_reinforce/blob/master/game_ac_network.py#L31 .
So it's like
policy_loss = -log(pi) * a + beta * entropy
. In this case,entropy
would be minimized.However,entropy
should be maximized to avoid convergence for exploration. Original paper says,Thus, correct policy loss should be
policy_loss = -log(pi) * a - beta * entropy
.If I am wrong, just please close this issue. I hope this helps you improve this implementation.
@miyosuda