Closed nyck33 closed 5 years ago
Hey, the action distribution is assumed to be a multivariate normal distribution with a diagonal covariance matrix. So, the last layer outputs the means of all variables in action space (i.e. mean vector) and the covariance matrix is just the diagonal matrix of the square of fixed standard deviation (hyper parameter : action_std). From the mean vector and the covariance matrix we can construct a multivariate normal distribution using standard PyTorch function.
Regarding self.policy
, Since we update the self.policy
for k_epochs
(i.e. k times) in one PPO update, we keep the self.policy_old
as a copy of old network weights to compute ratios.
I think you should refer to the original PPO paper for more detail.
I can see that your Actor network has tanh activation on the output layer but then I am totally lost as to what you do here:
Especially
action_mean = self.actor(state)
. Does this mean you have one output node and assume that the output is the mean of a Gaussian distribution over the action space?Then similar code appears here:
Also is
self.policy
like a dummy Actor Critic network that you use just to get updated parameters to load in toself.policy_old
? I know this isn't stackoverflow but if you can look at my implementation and let me know how I can adapt it for a continuous action space, that'd be great.my PPO discrete action space