Discrepancy in SAC on entropy coefficient update

pfnet / pfrl

PFRL: a PyTorch-based deep reinforcement learning library

MIT License

1.2k stars 157 forks source link

Discrepancy in SAC on entropy coefficient update #177

Open marioyc opened 2 years ago

marioyc commented 2 years ago

Noticed that here the log_prob variable is computed before the udpate of the actor while on SAC's repo it is recomputed after the actor update (the paper also mentions in Section 6 that an update is made on both q-function and policy before the update for the entropy coefficient). By any chance have you compared whether this detail makes a difference?

muupan commented 2 years ago

You are right, it seems to be a discrepancy from the official implementation. I do not remember whether I made a comparison, maybe not.

marioyc commented 2 years ago

I see, no problem, thanks for replying anyways.