difference in core code from p-christ's version

toshikwa / sac-discrete.pytorch

PyTorch implementation of SAC-Discrete.

MIT License

281 stars 34 forks source link

difference in core code from p-christ's version #2

Closed junhuang-ifast closed 4 years ago

junhuang-ifast commented 4 years ago

Hi, im looking at this section of your code (actor loss calculation)
https://github.com/ku2482/sac-discrete.pytorch/blob/ac02fec3c872f6b25208c080f1ba2e745380e757/code/agent.py#L296

and realised its quite different from this version of the code https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/f745a154e9f8ecb6119a86826969f50e57a6e5e7/agents/actor_critic_agents/SAC_Discrete.py#L81

any reason for this ? thanks

toshikwa commented 4 years ago

Hi, @junhuang-ifast

and realised its quite different from this version of the code

I think both codes calculate actor's loss in the same way.

p-christ's code calculates loss first, then calculates entropies. However, in fact entropies are calculated while calculating loss, so I explicitly calculate entropies, then calculate loss using entropies.

I hope it helps understanding the algorithm, which maximizes (Q + alpha * entropy).

Anyway, thank you for asking :)

junhuang-ifast commented 4 years ago

p-christ's code calculates loss first, then calculates entropies. However, in fact entropies are calculated while calculating loss, so I explicitly calculate entropies, then calculate loss using entropies.

wouldn't this lead to different policy loss being calculated? which leads to the network being updated differently.

toshikwa commented 4 years ago

Hi, @junhuang-ifast

I carefully read p-christ's code and noticed differences between mine and p-christ's.

p-christ's calculates loss as : (action_probs * (-Q + log_probs)).mean()
my code calculates loss as : (action_probs * (-Q + log_probs)).sum(dim=1).mean()

According to equation (12) of the paper, mine should be correct. However, results in the paper may be consequences of p-christ's version. (in fact, it's just the difference of the step size.) So I will test both cases and let you know.

I've never noticed this differences. Thank you for pointing it out :)

toshikwa commented 3 years ago

@ahmadreza9 I'm a master's student.

toshikwa commented 3 years ago

I have no idea what you are talking about...

ahmadreza9 commented 3 years ago

I have no idea what you are talking about...

I just wanted to know why you used sum for policy loss which you mentioned will test both cases and let @junhuang-ifast know. (I found descriptions in #7.)