Closed alirezakazemipour closed 4 years ago
You will get the correct results even without using action scaling. We anyway clip the actions in the end. @ku2482 made the argument that gradients and performance will be much smoother with action scaling which I intuitively agree with and that's why it was used in the main branch.
You are free to make PR for this (in SAC_V branch) if you want. :sweat_smile:
Yeah you have clipped it in the end when the select_action
method is called but the problem arises whenever you call method sample of the GaussainPolicy
like:
pi, log_pi, mean, log_std = self.policy.sample(state_batch)
using that pi
seems not correct because it's not been scaled. :smiley:
I have implemented the SAC based on your repo which was well readable and to be honest, no PR would be required :smile: except that scaling part that is missing on SAC_V branch and the effect it has had on enforcing action bounds on the master branch too: :point_down:
log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)
That multiplication of self.action_scale
sounds not correct.
In the end gym envs, by default, will clip the action to action range of the envs even if I don't use clipping. The problem with pi will only arise if you Initialize Policy weights in the wrong manner. xavieruniform and orthogonal_ will work in this case. Wrong initialization with or without action scaling will cause problems.
Thank you very much.
Hey.
action = y_t * self.action_scale + self.action_bias
this part is missing in sample function on SAC_V branch.