pranz24 / pytorch-soft-actor-critic

PyTorch implementation of soft actor critic
MIT License
795 stars 179 forks source link

reparametrization trick issue #5

Closed tldoan closed 5 years ago

tldoan commented 5 years ago

It seems that you are not implementing the reparametrization trick when taking an action

https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L98-L99

although you wrote it in coments my points is in that way shouldn t you used an other loss function according to this repo? (this should be the second actor_loss function? ) https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/twin_sac.py#L178-L184

thanks

tldoan commented 5 years ago

I mean: shouldn't you sample a normal(0,I) noise somewhere?

pranz24 commented 5 years ago

x_t = normal.rsample() is mean + std N(0,1) If I had used x_t = normal.sample, which is N(mean, std), then the loss function should've been `policy_loss = (log_pi (alpha*log_pi - log_policy_target).detach()).mean()` <- log derivative trick

tldoan commented 5 years ago

I see ! cool !

btw I have a doubt about the mean returned by the function evalute: https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L105

should you tanh the mu?? because you are regularizing it in the actor loss function also: https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L150

otherwise thank you again for the code it helps a tons !

pranz24 commented 5 years ago

Why should I tanh the mu?? I don't see any good reason to do so.

tldoan commented 5 years ago

the same reason for which you tanh the action? (action=mu+ epsilon . sigma ) https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L100 mu(s) is used during evaluation time which is not implemented in your repo ( the epsilon. sigma is for exploration during training time) I realized that when I saw return for training time was much better than for evaluation time

It is a question of consistency, either you "tanh" everywhere either you remove it everywhere

does it make sense?

pranz24 commented 5 years ago

I see ! Yes, you can use "tanh" on both mean and log_std. This will also solve the problem with eval part because I haven't used tanh on mean during eval time. (I'll fix that)

tldoan commented 5 years ago

but most importantly my question was about the mu u regularize in the loss function? should it be '0.001 *TANH(mean).pow(2).mean()' as a penalty regularizer? (I think so but not sure ) https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L150

pranz24 commented 5 years ago

To be very honest, I don't know. In the latest code, the authors don't use regularization loss at all and they also dropped the value function. https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py So, I think, you can drop the regularization loss.(if you want to) Also I haven't implemented eval runs at all. (will do that when I get time)

tldoan commented 5 years ago

yes you re right its confusing even from here https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/twin_sac.py#L220-L224

it seems they regularize mu, tanh(mu) and std lol

I saw so many different versions of this code (with different parameters each time of course ) so its hard to know who to rely on

pranz24 commented 5 years ago

Yup