Closed tldoan closed 5 years ago
I mean: shouldn't you sample a normal(0,I) noise somewhere?
x_t = normal.rsample()
is mean + std N(0,1)
If I had used x_t = normal.sample
, which is N(mean, std), then the loss function should've been `policy_loss = (log_pi (alpha*log_pi - log_policy_target).detach()).mean()` <- log derivative trick
I see ! cool !
btw I have a doubt about the mean returned by the function evalute: https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L105
should you tanh the mu?? because you are regularizing it in the actor loss function also: https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L150
otherwise thank you again for the code it helps a tons !
Why should I tanh the mu?? I don't see any good reason to do so.
the same reason for which you tanh the action? (action=mu+ epsilon . sigma ) https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L100 mu(s) is used during evaluation time which is not implemented in your repo ( the epsilon. sigma is for exploration during training time) I realized that when I saw return for training time was much better than for evaluation time
It is a question of consistency, either you "tanh" everywhere either you remove it everywhere
does it make sense?
I see ! Yes, you can use "tanh" on both mean and log_std. This will also solve the problem with eval part because I haven't used tanh on mean during eval time. (I'll fix that)
but most importantly my question was about the mu u regularize in the loss function? should it be '0.001 *TANH(mean).pow(2).mean()' as a penalty regularizer? (I think so but not sure ) https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L150
To be very honest, I don't know. In the latest code, the authors don't use regularization loss at all and they also dropped the value function. https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py So, I think, you can drop the regularization loss.(if you want to) Also I haven't implemented eval runs at all. (will do that when I get time)
yes you re right its confusing even from here https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/twin_sac.py#L220-L224
it seems they regularize mu, tanh(mu) and std lol
I saw so many different versions of this code (with different parameters each time of course ) so its hard to know who to rely on
Yup
It seems that you are not implementing the reparametrization trick when taking an action
https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/model.py#L98-L99
although you wrote it in coments my points is in that way shouldn t you used an other loss function according to this repo? (this should be the second actor_loss function? ) https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/twin_sac.py#L178-L184
thanks