How are you ensuring that actions are in range of (-1,1) after sampling in continuous action

nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch

MIT License

1.57k stars 332 forks source link

I was going through ppo.py code.

Assuming I have 2 actions to predect for continuous actions, action_dims = 2 Let the standard deviation be initialized as .5 So variance is .25.

I found the fallowing, the mean of action is predected predected by the actor net

Say it is torch.tesor([.7,.9])

Then u use the variance to sample actions

If the sample draws actions which are over 1 or below -1 which is range of permitted action, what do we do?

Do we sample again?

Do we clip?

Is it a good idea to have a tanh activation on the sampled action? (But it will mess with the actor network)

nikhilbarhate99 / PPO-PyTorch