nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.66k stars 343 forks source link

Question abot PPO_continuous.py #23

Closed HeegerGao closed 4 years ago

HeegerGao commented 4 years ago

Hello,

I am new to reinforcement learning. I find you set 'action_std' as a constant hyper-parameter in PPO_continuous.py, and only the 'action_mean' can be learned in the code. I don't know if it's a common operation in continuous action space problems or it's just in your method, as I think the 'action_std' should also be learned in the process. Can you give me some references to explain why you write it like this? Thank you very much!

nikhilbarhate99 commented 4 years ago

Yes, you could learn the distribution std with the action means but, it is not a necessary condition. While learning action std, std could collapse prematurely and hence the agent will not explore the environment properly. There are multiple ways of implementing deep RL algorithms, you are free choose any implementation that works for your problem (At least as of now there is no standard procedure to follow).

HeegerGao commented 4 years ago

Yes, you could learn the distribution std with the action means but, it is not a necessary condition. While learning action std, std could collapse prematurely and hence the agent will not explore the environment properly. There are multiple ways of implementing deep RL algorithms, you are free choose any implementation that works for your problem (At least as of now there is no standard procedure to follow).

OK, I see. Thank you very much.