PPO2 episode reward drops catastrophically during training

kp368 commented 4 years ago

Apologies if this is a more general RL question. I am using PPO2 with default params, with CNN policy. After about 100K I get a perfect model. Later on the model suddenly deteriorates and at the end of training it's useless. I understand that I should always save the best model and not the latest. However, why is this happening? Is it normal/expected? I have observed it with PPO2 in a number of applications.

Miffyli commented 4 years ago

You could find better answers in the docs or in OpenAI SpinningUp.

But TL;DR: Yes, this can happen. It may happen because of mathematical inaccuracies in updates (should be quite ironed out in stable-baselines), or simply due to environment/agent setup. E.g. it learns to complete the task one way (which gets high reward), but because of exploration it attempts something else that also seems promising. Sometimes this might lead outside the good, initial track, leading network to only be trained on the bad samples which may lead to forgetting the initial policy.

If you have no more questions related to stable-baselines, you may close this issue.

m-rph commented 4 years ago

You could find better answers in the docs or in OpenAI SpinningUp.

Note, the spinning up implementation uses the approximate kl divergence as an early stop mechanism. This isn't implemented in stable-baselines (tf1, torch).

@kp368 You should know, policy gradients simply approximate the true gradient and with small batch sizes, they are usually far off. In addition, PPO simply stops gradients from flowing backwards. The gradient of clip for values outside the range is 0. So this is equivalent of using an even smaller batch size.

Some good reads:

If you have no more questions related to stable-baselines, you may close this issue.

hill-a / stable-baselines

PPO2 episode reward drops catastrophically during training #837