If necessary: adjust action distribution for stable learning

In a later paper by Hsu et al., 2020, two common design choices in PPO are revisited, precisely (1) clipped probability ratio for policy regularization and (2) parameterize policy action space by continuous Gaussian or discrete softmax distribution. They first identified three failure modes in PPO and proposed replacements for these two designs.

The failure modes are:

On continuous action spaces, standard PPO is unstable when rewards vanish outside bounded support.
On discrete action spaces with sparse high rewards, standard PPO often gets stuck at suboptimal actions.
The policy is sensitive to initialization when there are locally optimal actions close to initialization.

Discretizing the action space or use Beta distribution helps avoid failure mode 1&3 associated with Gaussian policy. Using KL regularization (same motivation as in TRPO) as an alternative surrogate model helps resolve failure mode 1&2.

Source: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ppo

mesjou / human-frictions

If necessary: adjust action distribution for stable learning #21