Open mesjou opened 3 years ago
In a later paper by Hsu et al., 2020, two common design choices in PPO are revisited, precisely (1) clipped probability ratio for policy regularization and (2) parameterize policy action space by continuous Gaussian or discrete softmax distribution. They first identified three failure modes in PPO and proposed replacements for these two designs.
The failure modes are:
Discretizing the action space or use Beta distribution helps avoid failure mode 1&3 associated with Gaussian policy. Using KL regularization (same motivation as in TRPO) as an alternative surrogate model helps resolve failure mode 1&2.
Source: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ppo
Unstable behavior could be caused by gaussian action distribution. Possible solution: switch to beta-distribution