miyosuda / async_deep_reinforce

Asynchronous Methods for Deep Reinforcement Learning
Apache License 2.0
592 stars 192 forks source link

Incorrect policy loss #49

Closed takuseno closed 7 years ago

takuseno commented 7 years ago

Hi, I'm reading this repository to implement my own A3C. Then I found policy loss incorrect.

Current policy loss is at https://github.com/miyosuda/async_deep_reinforce/blob/master/game_ac_network.py#L31 .

# policy entropy
entropy = -tf.reduce_sum(self.pi * log_pi, reduction_indices=1)

 # policy loss (output)  (Adding minus, because the original paper's objective function is for gradient ascent, but we use gradient descent optimizer.)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.multiply( log_pi, self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )

So it's like policy_loss = -log(pi) * a + beta * entropy. In this case, entropy would be minimized.However, entropy should be maximized to avoid convergence for exploration. Original paper says,

image

Thus, correct policy loss should be policy_loss = -log(pi) * a - beta * entropy.

If I am wrong, just please close this issue. I hope this helps you improve this implementation.

@miyosuda

takuseno commented 7 years ago

Sorry, just I found I was wrong. image

Then negating the first line is following above equation.

I'm closing this issue.