Sign of pi_loss? - Githubissues

hholst80 commented 8 years ago

You are computing entropy in policy_output.py like:

- probs * log_probs

with a minus sign. This is expected to be positive (non-negative to be precise).

You are then computing pi_loss in a3c.py with a loop and subtracting terms:

for ...:
    pi_loss -= log_prob * advantage # sign (rhs) = sign(-advantage).
    pi_loss -= self.beta * entropy # sign (rhs) = 1. 
    v_loss += (v - R) ** 2 / 2

And finally you take loss as a (weighted) sum of pi_loss and v_loss.

Are you sure about this? It seems to me like you should add up pi_loss with += on both the terms in the loop?

hholst80 commented 8 years ago

On the other hand I think the purpose of the entropy in the pi_loss is to encourage high entropy actions. Do you agree? If so, we should minimize the negative entropy like you are doing.

NOTE: The maximum entropy is reached for prob=exp(-1) where entropy==exp(-1).

muupan commented 8 years ago

I'm sure about that.

pi_loss is what we want to minimize in optimization
We want to maximize A(s,a) log pi(a|s), so - is needed
We want to maximize the entropy of pi(a|s), so - is needed

hholst80 commented 8 years ago

Thank you for your time and help to remove my confusion.

muupan / async-rl

Sign of pi_loss? #22