Closed fuyw closed 2 years ago
We care, but not in the agent, action will be clipped within the wrapper. It will automatically clip action to the env specific bounds: https://github.com/vwxyzjn/cleanrl/blob/1099a5172c8d0d12b33e52013a799b1a3cc07fcc/cleanrl/ppo_continuous_action.py#L89
Thank you.
I noticed that if we remove the gym.wrappers.ClipAction
and directly clip the actions after we sample from the policy. Then the PPO algorithm runs poorly.
Here, the only difference is that does the PPOBuffer contain unclipped actions. In the later case, the PPOBuffer only contains clipped actions and some wrong lop_prob
values lead to instabilities.
In the former case, the PPO agent learns from some unclipped actions, which is unwanted but necessary for stable training.
Why it is unwanted? If you need a bounded distribution, you can always use Beta distribution instead of Normal.
It's unwanted because it is out of the action limit of the environment, i.e., mujoco. Using Beta distribution would be a nice choice, though which is less popular in existing PPO implementations.
Many thanks for the nice repo.
I have a question about the PPO algorithm, where we sample action from policy using
a ~ Normal(mu, std)
.Why don't we care about the action bounds as we do in SAC? Since the sampled action
a
is unbounded?