A question about the `PPO` algorithm

vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

http://docs.cleanrl.dev

Other

5.54k stars 631 forks source link

A question about the `PPO` algorithm #240

Closed fuyw closed 2 years ago

fuyw commented 2 years ago

Many thanks for the nice repo.

I have a question about the PPO algorithm, where we sample action from policy using a ~ Normal(mu, std).

Why don't we care about the action bounds as we do in SAC? Since the sampled action a is unbounded?

Howuhh commented 2 years ago

We care, but not in the agent, action will be clipped within the wrapper. It will automatically clip action to the env specific bounds: https://github.com/vwxyzjn/cleanrl/blob/1099a5172c8d0d12b33e52013a799b1a3cc07fcc/cleanrl/ppo_continuous_action.py#L89

fuyw commented 2 years ago

Thank you.

fuyw commented 2 years ago

I noticed that if we remove the gym.wrappers.ClipAction and directly clip the actions after we sample from the policy. Then the PPO algorithm runs poorly.

Here, the only difference is that does the PPOBuffer contain unclipped actions. In the later case, the PPOBuffer only contains clipped actions and some wrong lop_prob values lead to instabilities.

In the former case, the PPO agent learns from some unclipped actions, which is unwanted but necessary for stable training.

Howuhh commented 2 years ago

Why it is unwanted? If you need a bounded distribution, you can always use Beta distribution instead of Normal.

fuyw commented 2 years ago

It's unwanted because it is out of the action limit of the environment, i.e., mujoco. Using Beta distribution would be a nice choice, though which is less popular in existing PPO implementations.