Open tkelestemur opened 3 years ago
After some digging into the a2c code, I realized that the log probabilities of the policy need to have the shape of (update_steps, num_processes) so that it can probably be multiplied with the advantages. As a quick workaround, we can basically sum the log probabilities across the dimensions of the action space by changing this line to action_log_probs = pout.log_prob(actions).sum(dim=1)
as explained in this paper.
This should fix the A2C but a general approach for supporting multi discrete action spaces should be considered.
I guess you can wrap the output of SoftmaxCategoricalHead
with torch.distributions.Independent
so that your resulting distribution's batch_shape is (N,) and event_shape is (4,).
I'm also running into a similar issue with my environment, and even before going into the update rule, I'm facing the problem of having a multi-discrete action space with different number of actions along each dimension. For example, dimension 1 has 5 actions, dimension 2 has 3 actions and dimension 3 has 10 actions.
How would I code up the final layer of the policy in that case? In the issue above, the author could nicely unflatten the tensor into uniform shapes along each dimension but I'm not aware of any way to do it for multi-discrete action spaces with different dimensions.
Also, please let me know if you would rather have me open a new issue for this topic. Thanks!
I'm facing the problem of having a multi-discrete action space with different number of actions along each dimension. For example, dimension 1 has 5 actions, dimension 2 has 3 actions and dimension 3 has 10 actions.
I think this requires a new subclass of torch.distributions.Distrubution
that models a joint distribution of multiple categorical distributions of different sizes.
I'm facing the problem of having a multi-discrete action space with different number of actions along each dimension. For example, dimension 1 has 5 actions, dimension 2 has 3 actions and dimension 3 has 10 actions.
I think this requires a new subclass of
torch.distributions.Distrubution
that models a joint distribution of multiple categorical distributions of different sizes.
Yes, I think you're right. I've managed to get something simple that works to model individual Categorical torch distributions before combining them. Thanks a lot, although please do consider including agents that support MultiDiscrete action spaces in the future. I think it would be really helpful.
A perhaps easier but less clean workaround is to model it as a joint distribution of same-sized categorical distributions using Independent
and Categorical
but set logits for unused categories to very low values so that they are never sampled.
@muupan Thanks, wrapping the output of Categorical
with Independent
worked fine for the same size multi-action spaces.
@xylee95, Can you share how did you manage to get it working with different sizes of action spaces?
I'm currently writing a class that is based on the MultiCategoricalDistrubtion from stable_baseline3 and hopefully open a PR soon.
@tkelestemur Yes, that is exactly what I did. I wrote a class based on the MultiCategoricalDistribution from stable_baseline3 and changed some of the function names to fit the log_prob calls in the agent. It works fine but I've only tested it with PPO so far and not other agents. If you need more details, I'll be happy to share
@xylee95 can you share your implementation? I've tried to write a subclass of torch.distributions.Distrubution
but didn't have much success.
@tkelestemur This is my implementation. It is almost a copy and paste of stable_baseline3 code and I did not write it as a sub-class of torch.distributions.Distribution
, instead I created a new class which returns a list of torch distributions. It would definitely be much cleaner if written as a subclass of torch.distributions.Distribution
class MultiCategoricalDistribution():
def __init__(self, action_dims):
"""Initialization
"""
super(MultiCategoricalDistribution, self).__init__()
self.action_dims = action_dims
def proba_distribution_net(self, latent_dim):
"""
Create the layer that represents the distribution.
It will be the logits (flattened) of the MultiCategorical distribution.
You can then get probabilities using a softmax on each sub-space.
"""
action_logits = nn.Linear(latent_dim, sum(self.action_dims))
return action_logits
def proba_distribution(self, action_logits):
"""Create a list of categorical distribution for each dimension
"""
self.distribution = [torch.distributions.Categorical(logits=split) for split in torch.split(action_logits, tuple(self.action_dims), dim=1)]
return self
def log_prob(self, actions):
"""Extract each discrete action and compute log prob for their respective distributions
"""
return torch.stack(
[dist.log_prob(action) for dist, action in zip(self.distribution, torch.unbind(actions, dim=1))], dim=1
).sum(dim=1)
def entropy(self):
"""Computes sum of entropy of individual caterogical dist
"""
return torch.stack([dist.entropy() for dist in self.distribution], dim=1).sum(dim=1)
def sample(self):
"""Samples actions from each individual categorical dist
"""
return torch.stack([dist.sample() for dist in self.distribution], dim=1)
def mode(self):
"""Computes mode of each categorical dist.
"""
return torch.stack([torch.argmax(dist.probs, dim=1) for dist in self.distribution], dim=1)
def get_actions(self, deterministic=False):
"""Return actions according to the probability distribution.
"""
if deterministic:
return self.mode()
return self.sample()
def actions_from_params(self, action_logits, deterministic=False):
"""Update the proba distribution
"""
self.proba_distribution(action_logits)
return self.get_actions(deterministic=deterministic)
def log_prob_from_params(self, action_logits):
"""Compute log-probability of actions
"""
actions = self.actions_from_params(action_logits)
log_prob = self.log_prob(actions)
return actions, log_prob````
I have a custom environment with a MultiDiscrete action space. The MultiDiscrete action space allows controlling an agent with n-dimensional discrete action spaces.
In my environment, I have 4 dimensions where each dimension has 11 actions. I'm trying to use A2C with a Softmax policy. Below is the implementation of the policy and value networks. The output of the policy gives me [N, 4, 11] tensor where N is the batch size. The softmax is applied to the last dimension of this tensor so basically, I have 4 action distributions. I thought this would work but I'm getting the following error:
Do I need to make changes to the A2C or am I doing something wrong?