iffiX / machin

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...
MIT License
400 stars 51 forks source link

Multi Discrete Action Spaces #20

Closed joaomatoscf closed 3 years ago

joaomatoscf commented 3 years ago

Hello,

Does machin support Multi Discrete Action Spaces? (two different actions in the same time step) I've looked through the documentation but cannot find anything related to that

João

iffiX commented 3 years ago

It depends on your model, machin doesn't enforce any restriction on your actor output, you could use mixed action space (discrete + continuous), multi-action space, parameterized action space, etc.

If you provide more details I can help with your implementation.

joaomatoscf commented 3 years ago

Thank you for your answer, If I want to use this code to work for example with an action MultiDiscrite([2,2]), what do I need to change?

Code (based on machin examples/tutorials):

max_episodes = 50 max_steps = 10000

observe_dim = env.observation_space.shape[0] action_num = env.action_space.n

model definition

class Actor(nn.Module): def init(self, state_dim, action_num): super(Actor, self).init()

    self.fc1 = nn.Linear(state_dim, 16)
    self.fc2 = nn.Linear(16, 16)
    self.fc3 = nn.Linear(16, action_num)

def forward(self, state, action=None):
    a = t.relu(self.fc1(state))
    a = t.relu(self.fc2(a))
    probs = t.softmax(self.fc3(a), dim=1)
    dist = Categorical(probs=probs)
    act = (action
           if action is not None
           else dist.sample())
    act_entropy = dist.entropy()
    act_log_prob = dist.log_prob(act.flatten())
    return act, act_log_prob, act_entropy

class Critic(nn.Module): def init(self, state_dim): super(Critic, self).init()

    self.fc1 = nn.Linear(state_dim, 16)
    self.fc2 = nn.Linear(16, 16)
    self.fc3 = nn.Linear(16, 1)

def forward(self, state):
    v = t.relu(self.fc1(state))
    v = t.relu(self.fc2(v))
    v = self.fc3(v)
    return v
iffiX commented 3 years ago

If your actions are independent and sampled from the same distribution, you can reuse the same parameter for Categorical in Actor, then sample act1, act2, act3, act4 from the distribution, and finally return these four actions as a tensor of shape [2, 2] and the sum of their log probability.

If your actions are independent and sampled from different distributions, then you need 4 output heads self.fc3_1, self.fc3_2, self.fc3_3, self.fc3_4 for each categorical distribution. Then do the same modification as above.

I will cite my answer on the PyTorch forum as a reference here, (Note when I say multinomial, what I really mean is extracting each trial in the multinomial distribution as a categorical distribution, which is equivalent to what is described above)

joaomatoscf commented 3 years ago

Thank you for your detailed explanation. I implemented it, and now I get an error when storing the episode: ValueError: Key "action" of transition major attribute "action" has invalid batch size 2.

Under it's what the modifications I made look like: class Actor(nn.Module): def init(self, state_dim, action_num): super(Actor, self).init()

    self.fc1 = nn.Linear(state_dim, 16)
    self.fc2 = nn.Linear(16, 16)
    self.fc3 = nn.Linear(16, action_num)

def forward(self, state, action=None):
    a = t.relu(self.fc1(state))
    a = t.relu(self.fc2(a))
    probs = t.softmax(self.fc3(a), dim=1)
    dist = Categorical(probs=probs)
    act1 = (action
           if action is not None
           else dist.sample())
    act2 = (action
           if action is not None
           else dist.sample())
    act_entropy = dist.entropy()
    act1_log_prob = dist.log_prob(act1.flatten())
    act2_log_prob = dist.log_prob(act2.flatten())
    act = t.tensor([act1,act2])
    return act, act1_log_prob+act2_log_prob , act_entropy
iffiX commented 3 years ago

act should be act = t.tensor([[act1,act2]]), the first dimension is always the batch dimension, and note that when action is not None it will be your new action act = t.tensor([[act1,act2]]), so your code should look like:

def forward(self, state, action=None):
    a = t.relu(self.fc1(state))
    a = t.relu(self.fc2(a))
    probs = t.softmax(self.fc3(a), dim=1)
    dist = Categorical(probs=probs)
    act1 = (action[:, 0]
           if action is not None
           else dist.sample())
    act2 = (action[:, 1]
           if action is not None
           else dist.sample())
    act_entropy = dist.entropy()
    act1_log_prob = dist.log_prob(act1.flatten())
    act2_log_prob = dist.log_prob(act2.flatten())
    act = t.tensor([[act1,act2]])
    return act, act1_log_prob+act2_log_prob , act_entropy
joaomatoscf commented 3 years ago

I added those changes and it solved the issue of storing the episode.

Now the issue is on ppo.update(): image

This is because when updating the act1 and act2 are no longer scalars, but 1D arrays. How do you propose to solve this?

iffiX commented 3 years ago

Oh, sorry I forgot that behavior, just use torch.cat in that case:

act = t.cat((act1.view(1, -1), act2.view(1, -1), dim=0)