muupan / async-rl

Replicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)
MIT License
400 stars 81 forks source link

Continous control #3

Open muupan opened 8 years ago

igrekun commented 8 years ago

I'm working on LSTM implementation (neon based) for the continuous case, sadly I failed to get any response from authors.

It is variance and entropy that puzzles me. Any thoughts on how that is implemented code-wise? Currently it shows no signs of convergence on mujoco domain for me and most likely there are errors in learnt variance for gaussian policy.

muupan commented 8 years ago

Thanks for information. I haven't tried it yet, but the paper provides some information as below. Did you find it is not sufficient?

µ is modeled by a linear layer and σ2 by a SoftPlus operation, log(1 + exp(x)), as the activation computed as a function of the output of a linear layer.

we used a cost on the differential entropy of the normal distribution defined by the output of the actor network, −1/2 (log(2πσ2)+1), we used a constant multiplier of 10−4 for this cost across all of the tasks examined.

etienne87 commented 8 years ago

It is a bit vague for me so I will try to summarize in order to be corrected : we need a fully connected layer outputting 2 values, add 1 softplus operation for second value (so that variance is > 0 I suppose), sample according to this gaussian (use numpy.randn * sigma + mu ?) in each dimension of action space, and finally send −1/2 (log(2πσ2)+1 as logprob instead of log(softmax) ?

loofahcus commented 7 years ago

hi, @muupan , do you have a plan to implement continous control? : )

etienne87 commented 7 years ago

Here is an example:

class GaussianPolicyOutput(PolicyOutput):
    def __init__(self, logits_mu, logits_var):
        self.logits_mu = logits_mu
        self.logits_var = logits_var

        #print("self.logits_mu.data: ", self.logits_mu.data)

    @cached_property
    def action_indices(self):
        # the function has same name as for SoftmaxPolicyOutput so that the function
        # can be called from a3c.py without changes
        # however, the function samples from gaussian distributions

        mu, sigma2 = self.activation

        action = np.zeros(mu.data.shape, dtype = 'float32')

        #print("mu.data: ", mu.data)
        #print("sigma2.data: ", sigma2.data)
        for i in xrange(mu.data.shape[0]):
            action[i] = np.random.normal(mu.data[i], np.sqrt(sigma2.data[i]))
        #print("action: ", action)
        return action

    @cached_property
    def activation(self):
        mu = F.tanh(self.logits_mu) # output is in [-1,1]
        sigma2 = F.softplus(self.logits_var) #rectified output
        return mu, sigma2

    @cached_property
    def sampled_actions_log_probs(self):
        # returns chainer variable with log prob of the sampled action

        # activation
        mu, sigma2 = self.activation

        # sample action
        action = self.action_indices

        # compute neg. log likelihood
        #print("chainer.Variable(action).dtype: ", chainer.Variable(action).dtype)
        #print("mu.dtype: ", mu.dtype)
        #print("F.log(sigma2).dtype: ", F.log(sigma2).dtype)

        return -F.gaussian_nll(chainer.Variable(action), mu, F.log(sigma2))

    @cached_property
    def entropy(self):
        mu, sigma2 = self.activation
        return - F.sum(0.5*(np.log(2*np.pi*sigma2.data[0])+1))

haven't tested yet, so feel free to test/ correct

loofahcus commented 7 years ago

Thanks! @etienne87