nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.57k stars 332 forks source link

How are you ensuring that actions are in range of (-1,1) after sampling in continuous action #47

Closed PhanindraParashar closed 2 years ago

PhanindraParashar commented 2 years ago

I was going through ppo.py code.

Assuming I have 2 actions to predect for continuous actions, action_dims = 2 Let the standard deviation be initialized as .5 So variance is .25.

I found the fallowing, the mean of action is predected predected by the actor net

Say it is torch.tesor([.7,.9])

Then u use the variance to sample actions

If the sample draws actions which are over 1 or below -1 which is range of permitted action, what do we do?

Do we sample again?

Do we clip?

Is it a good idea to have a tanh activation on the sampled action? (But it will mess with the actor network)

nikhilbarhate99 commented 2 years ago

Gym automatically clips the input. For continuous control envs, it makes sense to use a tanh activation as long as the action space of the env is between 1 and -1 and we are decreasing the variance.

For custom environments most people clip the tensor.