Incorrect policy loss - Githubissues

Hi, I'm reading this repository to implement my own A3C. Then I found policy loss incorrect.

Current policy loss is at https://github.com/miyosuda/async_deep_reinforce/blob/master/game_ac_network.py#L31 .

# policy entropy
entropy = -tf.reduce_sum(self.pi * log_pi, reduction_indices=1)

 # policy loss (output)  (Adding minus, because the original paper's objective function is for gradient ascent, but we use gradient descent optimizer.)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.multiply( log_pi, self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

So it's like policy_loss = -log(pi) * a + beta * entropy. In this case, entropy would be minimized.However, entropy should be maximized to avoid convergence for exploration. Original paper says,

Thus, correct policy loss should be policy_loss = -log(pi) * a - beta * entropy.

If I am wrong, just please close this issue. I hope this helps you improve this implementation.

@miyosuda

miyosuda / async_deep_reinforce

Incorrect policy loss #49