Closed hholst80 closed 4 years ago
On the other hand I think the purpose of the entropy in the pi_loss
is to encourage high entropy actions. Do you agree? If so, we should minimize the negative entropy like you are doing.
NOTE: The maximum entropy is reached for prob=exp(-1)
where entropy==exp(-1)
.
I'm sure about that.
pi_loss
is what we want to minimize in optimizationA(s,a) log pi(a|s)
, so -
is neededpi(a|s)
, so -
is neededThank you for your time and help to remove my confusion.
You are computing entropy in
policy_output.py
like:with a minus sign. This is expected to be positive (non-negative to be precise).
You are then computing
pi_loss
ina3c.py
with a loop and subtracting terms:And finally you take
loss
as a (weighted) sum ofpi_loss
andv_loss
.Are you sure about this? It seems to me like you should add up
pi_loss
with+=
on both the terms in the loop?