ikostrikov / pytorch-a3c

PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".
MIT License
1.22k stars 280 forks source link

How to modify code for continuous actions? #5

Closed AjayTalati closed 7 years ago

AjayTalati commented 7 years ago

Hello :)

I was wondering how to modify the code for continuous actions? So for example it could be compared with your naf implementation on openAI gym pendulum,

env = gym.envs.make("Pendulum-v0")

Here's how far I got,

lstm_in = 3
lstm_out = 256

class ActorCritic(nn.Module):

    def __init__(self , lstm_in ):
        super(ActorCritic, self).__init__( )

        self.lstm = nn.LSTMCell(lstm_in, lstm_out)

        self.actor_mu = nn.Linear(lstm_out, 1)
        self.actor_sigma = nn.Linear(lstm_out, 1)

        self.critic_linear = nn.Linear(lstm_out, 1)

        self.train()

    def forward(self, inputs):

        x, (hx, cx) = inputs

        #might need some RELU's here ??

        hx, cx = self.lstm(x, (hx, cx))
        x = hx

        return self.critic_linear(x), self.actor_mu(x), self.actor_sigma(x), (hx, cx)

and the code in main now looks like,

env = gym.envs.make("Pendulum-v0")
lstm_in = 3    
global_model = ActorCritic( lstm_in )
global_model.share_memory()
local_model = ActorCritic( lstm_in )

It breaks with the following changes in train.py though,

env = gym.envs.make("Pendulum-v0")
s0 = env.reset()
done = True
state = torch.from_numpy(s0).float().unsqueeze(0) 
value, mu, sigma, (hx, cx) = local_model((Variable(state), (hx, cx)))

#mu = mu.clamp(-1, 1) # constain to sensible values 
Softplus=nn.Softplus()     
sigma = Softplus(sigma) #+ 1e-5 # constrain to sensible values
normal_dist = torch.normal(mu, sigma) 

prob = normal_dist
# TODO - what goes here?
#nnlog = nn.Log() 
#log_prob = nnlog(prob)

#log_prob = F.log_softmax(prob)
#prob = F.softmax(logit)
#log_prob = F.log_softmax(logit)
entropy = -(log_prob * prob).sum(1)

action = prob.data
action = Variable( action )
log_prob = log_prob.gather(1, action)

#action=[0,]
state, reward, done, _ = env.step(action.data)

Any idea how to get it working?

Thanks a lot for your help,

Best regards,

Ajay

Reference - Deepmind A3C's paper, https://arxiv.org/pdf/1602.01783.pdf Section 9 - Continuous Action Control Using the MuJoCo Physics Simulator

image

Picture from https://github.com/deeplearninc/relaax#distributed-a3c-architecture-with-continuous-actions

AjayTalati commented 7 years ago

Got nearly working code in this thread,

https://discuss.pytorch.org/t/continuous-action-a3c/1033

AjayTalati commented 7 years ago

Hi, any chance you could give me some advice? I'm still stuck trying to get this to work? Here's a post of my code

https://gist.github.com/AjayTalati/184fec867380f6fa22b9aa0951143dec

I keep getting this error,

File "main_single.py", line 174, in <module>
value_loss = value_loss + advantage.pow(2)
AttributeError: 'numpy.ndarray' object has no attribute 'pow'

I don't understand why advantage has become a numpy array instead of a torch.tensor - it never occurred with the discrete action implementation?

Any ideas what I've got wrong?

Thanks a lot for your help,

Best,

Ajay

AjayTalati commented 7 years ago

Closing this, as continuous functions are just a pain to approximate?

ikostrikov commented 7 years ago

I will add continuous control later. I don't have time at the moment.

AjayTalati commented 7 years ago

OK - cool - take your time :+1: I don't mean this as an A3C specific comment, or anything specific about your implementation.

It's just a general observation, (and perhaps a provable fact), that I've found discrete functions easier to approximate that continuous ones.

In terms of simple MLP theory, this is nice by Mhaskar and Poggio,

Learning Real and Boolean Functions: When Is Deep Better Than Shallow