Open muupan opened 8 years ago
Thanks for information. I haven't tried it yet, but the paper provides some information as below. Did you find it is not sufficient?
µ is modeled by a linear layer and σ2 by a SoftPlus operation, log(1 + exp(x)), as the activation computed as a function of the output of a linear layer.
we used a cost on the differential entropy of the normal distribution defined by the output of the actor network, −1/2 (log(2πσ2)+1), we used a constant multiplier of 10−4 for this cost across all of the tasks examined.
It is a bit vague for me so I will try to summarize in order to be corrected : we need a fully connected layer outputting 2 values, add 1 softplus operation for second value (so that variance is > 0 I suppose), sample according to this gaussian (use numpy.randn * sigma + mu ?) in each dimension of action space, and finally send −1/2 (log(2πσ2)+1 as logprob instead of log(softmax) ?
hi, @muupan , do you have a plan to implement continous control? : )
Here is an example:
class GaussianPolicyOutput(PolicyOutput):
def __init__(self, logits_mu, logits_var):
self.logits_mu = logits_mu
self.logits_var = logits_var
#print("self.logits_mu.data: ", self.logits_mu.data)
@cached_property
def action_indices(self):
# the function has same name as for SoftmaxPolicyOutput so that the function
# can be called from a3c.py without changes
# however, the function samples from gaussian distributions
mu, sigma2 = self.activation
action = np.zeros(mu.data.shape, dtype = 'float32')
#print("mu.data: ", mu.data)
#print("sigma2.data: ", sigma2.data)
for i in xrange(mu.data.shape[0]):
action[i] = np.random.normal(mu.data[i], np.sqrt(sigma2.data[i]))
#print("action: ", action)
return action
@cached_property
def activation(self):
mu = F.tanh(self.logits_mu) # output is in [-1,1]
sigma2 = F.softplus(self.logits_var) #rectified output
return mu, sigma2
@cached_property
def sampled_actions_log_probs(self):
# returns chainer variable with log prob of the sampled action
# activation
mu, sigma2 = self.activation
# sample action
action = self.action_indices
# compute neg. log likelihood
#print("chainer.Variable(action).dtype: ", chainer.Variable(action).dtype)
#print("mu.dtype: ", mu.dtype)
#print("F.log(sigma2).dtype: ", F.log(sigma2).dtype)
return -F.gaussian_nll(chainer.Variable(action), mu, F.log(sigma2))
@cached_property
def entropy(self):
mu, sigma2 = self.activation
return - F.sum(0.5*(np.log(2*np.pi*sigma2.data[0])+1))
haven't tested yet, so feel free to test/ correct
Thanks! @etienne87
I'm working on LSTM implementation (neon based) for the continuous case, sadly I failed to get any response from authors.
It is variance and entropy that puzzles me. Any thoughts on how that is implemented code-wise? Currently it shows no signs of convergence on mujoco domain for me and most likely there are errors in learnt variance for gaussian policy.