Noticed that here the log_prob variable is computed before the udpate of the actor while on SAC's repo it is recomputed after the actor update (the paper also mentions in Section 6 that an update is made on both q-function and policy before the update for the entropy coefficient). By any chance have you compared whether this detail makes a difference?
Noticed that here the
log_prob
variable is computed before the udpate of the actor while on SAC's repo it is recomputed after the actor update (the paper also mentions in Section 6 that an update is made on both q-function and policy before the update for the entropy coefficient). By any chance have you compared whether this detail makes a difference?