Open HJ-TANG opened 4 years ago
I believe this relates to the clipping of Gaussian distributions and whatnot, and a possible could be to do "squashing" as described by @araffin in https://github.com/hill-a/stable-baselines/issues/704#issuecomment-596092053). @araffin can you comment as this is more of your region?
Meanwhile you could try PPO2 (a more mature implementation) or SAC (which should support squashing).
Regarding the squashing, best is to read SAC paper. Otherwise, you can take a look at https://github.com/DLR-RM/stable-baselines3/blob/a1e055695c3638f9f15de0cb805b8fcbb5c02764/stable_baselines3/common/distributions.py#L195
or https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/sac/policies.py#L44
to see how to properly replace the Gaussian distribution by squashed one, and account for it when computing the log likelihood.
As mentioned in the doc, because you have continuous actions, it recommended to give SAC/TD3 a try using the hyperparameters from the rl-zoo.
Hi I'm using PPO1 in my experiments. I scale the action between [-1, 1] and do the rescale in the environment. The result is relatively good but for one issue, the actions are very unstable and have only -1 and 1 value.
In this comment it's said
I think it may be helpful to me. So I wanna know how to implement this, just setting the activation function of the custom policy to tanh seems not to work. Or do you have any good ideas about this issue?
Thanks a lot!