nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.57k stars 332 forks source link

Monotonic improvement of PPO #37

Closed olixu closed 2 years ago

olixu commented 3 years ago

whe i use other implementation such as stable-baselines3. it does have Monotonic improvement, which means that the mean of the rewards get better after each update.

while using your implementation, there seems no monotonic garantee.

Can you help me explain the reason.

Thanks.

gianlucadest commented 3 years ago

PPO utilizes the actor-critic framework, therefore probability distributions. While training, you have performance fluctuations due to sampling better or worse actions. There is no guarantee that the algorithm has a better score than before.

While testing, you select the action with the highest probability. In that case, the actor critic starts to act deterministic in each state.

sincerely

nikhilbarhate99 commented 2 years ago

You can check the April Update which is a bit more stable. Better Advantage estimate like GAE should also stabilize the training.