Closed junhuang-ifast closed 4 years ago
Hi, @junhuang-ifast
and realised its quite different from this version of the code
I think both codes calculate actor's loss in the same way.
p-christ's code calculates loss first, then calculates entropies. However, in fact entropies are calculated while calculating loss, so I explicitly calculate entropies, then calculate loss using entropies.
I hope it helps understanding the algorithm, which maximizes (Q + alpha * entropy).
Anyway, thank you for asking :)
p-christ's code calculates loss first, then calculates entropies. However, in fact entropies are calculated while calculating loss, so I explicitly calculate entropies, then calculate loss using entropies.
wouldn't this lead to different policy loss being calculated? which leads to the network being updated differently.
Hi, @junhuang-ifast
I carefully read p-christ's code and noticed differences between mine and p-christ's.
According to equation (12) of the paper, mine should be correct. However, results in the paper may be consequences of p-christ's version. (in fact, it's just the difference of the step size.) So I will test both cases and let you know.
I've never noticed this differences. Thank you for pointing it out :)
@ahmadreza9 I'm a master's student.
I have no idea what you are talking about...
I have no idea what you are talking about...
I just wanted to know why you used sum for policy loss which you mentioned will test both cases and let @junhuang-ifast know. (I found descriptions in #7.)
Hi, im looking at this section of your code (actor loss calculation)
https://github.com/ku2482/sac-discrete.pytorch/blob/ac02fec3c872f6b25208c080f1ba2e745380e757/code/agent.py#L296
and realised its quite different from this version of the code https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/f745a154e9f8ecb6119a86826969f50e57a6e5e7/agents/actor_critic_agents/SAC_Discrete.py#L81
any reason for this ? thanks