muupan / async-rl

Replicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)
MIT License
400 stars 81 forks source link

Sign of pi_loss? #22

Closed hholst80 closed 4 years ago

hholst80 commented 8 years ago

You are computing entropy in policy_output.py like:

- probs * log_probs

with a minus sign. This is expected to be positive (non-negative to be precise).

You are then computing pi_loss in a3c.py with a loop and subtracting terms:

for ...:
    pi_loss -= log_prob * advantage # sign (rhs) = sign(-advantage).
    pi_loss -= self.beta * entropy # sign (rhs) = 1. 
    v_loss += (v - R) ** 2 / 2

And finally you take loss as a (weighted) sum of pi_loss and v_loss.

Are you sure about this? It seems to me like you should add up pi_loss with += on both the terms in the loop?

hholst80 commented 8 years ago

On the other hand I think the purpose of the entropy in the pi_loss is to encourage high entropy actions. Do you agree? If so, we should minimize the negative entropy like you are doing.

image

NOTE: The maximum entropy is reached for prob=exp(-1) where entropy==exp(-1).

muupan commented 8 years ago

I'm sure about that.

hholst80 commented 8 years ago

Thank you for your time and help to remove my confusion.