ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.59k stars 830 forks source link

max_grad_norm and use_clipped_value_loss #160

Closed seungjaeryanlee closed 5 years ago

seungjaeryanlee commented 5 years ago

Hello! I was documenting your PPO code algo/ppo.py to improve my understanding of the algorithm, and I got confused on max_grad_norm and _use_clipped_value_loss.

If I am understanding this correctly, max_grad_norm is given to nn.utils.clip_grad_norm_() to set maximum gradient size, and _use_clipped_value_loss. However, I could not find relevent detail in the paper Proximal Policy Optimization Algorithm. If it was explicitly mentioned here, would you please point it out for me?

For L^VF, The paper seems to use the simple squared loss, equivalent to use_clipped_value_loss=False, but I could not find anything about the case when use_clipped_value_loss=True. Is this a trick not mentioned in the paper?

Thank you in advance for your help. Happy holidays!

ikostrikov commented 5 years ago

They introduced the new loss in the implementation of PPO2: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py#L63

Also see grad normalization here: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py#L102

seungjaeryanlee commented 5 years ago

Thank you for the links! I see how they correspond to those parts of PPO2 in OpenAI Baselines.

It's unfortunate that these changes are not written in any paper. Guess I will have to read openai/baselines code as well.