Closed olixu closed 2 years ago
PPO utilizes the actor-critic framework, therefore probability distributions. While training, you have performance fluctuations due to sampling better or worse actions. There is no guarantee that the algorithm has a better score than before.
While testing, you select the action with the highest probability. In that case, the actor critic starts to act deterministic in each state.
sincerely
You can check the April Update which is a bit more stable. Better Advantage estimate like GAE should also stabilize the training.
whe i use other implementation such as stable-baselines3. it does have Monotonic improvement, which means that the mean of the rewards get better after each update.
while using your implementation, there seems no monotonic garantee.
Can you help me explain the reason.
Thanks.