hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.12k stars 726 forks source link

[question] unstable actions in PPO #1018

Open HJ-TANG opened 3 years ago

HJ-TANG commented 3 years ago

Hi I'm using PPO1 in my experiments. I scale the action between [-1, 1] and do the rescale in the environment. The result is relatively good but for one issue, the actions are very unstable and have only -1 and 1 value.

In this comment it's said

adding normalization directly in the network (through a Tanh layer)

Telling the agent that the actions are in [-1, 1], without using Tanh, simply not works because this leads to saturating the actions (-1 or 1) from the first step evaluation and this leads the actor to immediately become unstable (and therefore learning proceed very very slowly, if proceed at all).

I think it may be helpful to me. So I wanna know how to implement this, just setting the activation function of the custom policy to tanh seems not to work. Or do you have any good ideas about this issue?

Thanks a lot!

Miffyli commented 3 years ago

I believe this relates to the clipping of Gaussian distributions and whatnot, and a possible could be to do "squashing" as described by @araffin in https://github.com/hill-a/stable-baselines/issues/704#issuecomment-596092053). @araffin can you comment as this is more of your region?

Meanwhile you could try PPO2 (a more mature implementation) or SAC (which should support squashing).

araffin commented 3 years ago

Regarding the squashing, best is to read SAC paper. Otherwise, you can take a look at https://github.com/DLR-RM/stable-baselines3/blob/a1e055695c3638f9f15de0cb805b8fcbb5c02764/stable_baselines3/common/distributions.py#L195

or https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/sac/policies.py#L44

to see how to properly replace the Gaussian distribution by squashed one, and account for it when computing the log likelihood.

As mentioned in the doc, because you have continuous actions, it recommended to give SAC/TD3 a try using the hyperparameters from the rl-zoo.