[Question] Hybrid action space

ilyalasy commented 3 years ago

Hey, I'm trying to implement hybrid action space with A2C agent, maybe you have some advice. My expected output are two actions: one discrete, one continuous. Network predicts 3 things:

logits for discrete action (sampling from Categorical dist)
mean for distribution of continuous action
std for same (sampling from Normal)

Net outputs sum of log probabilities of actions from both distributions (same for entropy). Network successfully learns the mean and std but the weight for the logits layers are not updated at all. What can be the reason?

iffiX commented 3 years ago

The actor-network gets its gradient through log probability and since you output the sum of these two log probs it should be fine.

From your description, I guess it might be a problem in your network implementation, could you please show your code of the forward call?

ilyalasy commented 3 years ago

I have kinda complicated code, so I cut some unnecessary parts and changed some to pseudo-code, but the main logic remains.

def forward(self, state, valid_actions, intervals, action=None,continuous_action=None):
    ### extract features
    features = F.relu(self.conv1(state))

    ### predict discrete action
    predicted = self.action_head(features) # Linear to max available actions

    masked_features = self._mask(predicted,valid_actions)  # masking with valid actions for current state

    probs = t.softmax(masked_features,dim=1) 

    dist = Categorical(probs)
    action = action if action is not None else dist.sample()     
    discrete_log_prob = dist.log_prob(action)
    entropy = dist.entropy()

    ### predict continuous action
    pooled_features = global_mean_pool(features)
    mu = self.mu_act(self.mu_head(pooled_features)) # Linear to 1 + sigmoid
    sigma = self.sigma_act(self.sigma_head(pooled_features))  # Linear to 1 + sigmoid

    # in each state continuous action can belong to different interval
    indexes = action.repeat(1,1,2)
    intervals = t.gather(intervals,1,indexes)
    a,b = intervals[...,0].view(-1,1),intervals[...,1].view(-1,1)
    mean = mu * (b-a) + a
    std = sigma * (b-a) + a 

    dist = TruncatedNormal(mean, std,a,b)
    continuous_action = (continuous_action
            if continuous_action is not None
            else dist.sample())
    continuous_action_log_prob = dist.log_prob(continuous_action)
    continuous_entropy = dist.entropy()

    # variables started with need_ calculated earlier and equal to either 0 or 1: 
    # sometimes it's necessary to generate only one of the actions, not both

    continuous_action = need_continuous * continuous_action 
    action_log_prob = need_continuous * continuous_action_log_prob + need_discrete * discrete_log_prob
    entropy = need_continuous * continuous_entropy + need_discrete * entropy

    return (action.long(), (continuous_action.float(),mean,std) ), action_log_prob.float(),entropy

iffiX commented 3 years ago

I suspect self._mask may output something detached from your action head input, or if you are using a binary mask of all 0 to multiply your action head then it is also possible to have no gradient going to your discrete network.

You can set visualize=True when you initialize your DQN agent, a pdf file of your network gradient flow will be generated. Can you post that visualization here?

ilyalasy commented 3 years ago

My self._mask is masked_fill_ with fill_value=float('-inf') so the output of softmax would be 0 for this values. Attaching the visualization only for first part of the code with categorical distribution only (I temporary deleted the continuous part just for testing) actor.pdf P.S. The flow graph is complicated because actually conv1 is Graph convolution

iffiX commented 3 years ago

The flow seems fine to me, at least after the part pf ReLUBackward0, which should correspond to your features = F.relu(self.conv1(state)), so there probably is no problem in the framework.

I would recommend you take a look at machin.utils.checker.check_model function, and setup some custom checker to find where gradient becomes zero.

Another possible cause is that the DQN algorithm may not optimize the discrete part, if by calling".parameters()" method on your actor model does not return your discrete net parameters. But this case is rare.

ilyalasy commented 3 years ago

Yep, I think I found the problem. It's not connected to machin. Sorry for disturbance and thank you for your help.

iffiX commented 3 years ago

No problem, happy to help you. :)

iffiX / machin

[Question] Hybrid action space #16