Closed ilyalasy closed 3 years ago
The actor-network gets its gradient through log probability and since you output the sum of these two log probs it should be fine.
From your description, I guess it might be a problem in your network implementation, could you please show your code of the forward call?
I have kinda complicated code, so I cut some unnecessary parts and changed some to pseudo-code, but the main logic remains.
def forward(self, state, valid_actions, intervals, action=None,continuous_action=None):
### extract features
features = F.relu(self.conv1(state))
### predict discrete action
predicted = self.action_head(features) # Linear to max available actions
masked_features = self._mask(predicted,valid_actions) # masking with valid actions for current state
probs = t.softmax(masked_features,dim=1)
dist = Categorical(probs)
action = action if action is not None else dist.sample()
discrete_log_prob = dist.log_prob(action)
entropy = dist.entropy()
### predict continuous action
pooled_features = global_mean_pool(features)
mu = self.mu_act(self.mu_head(pooled_features)) # Linear to 1 + sigmoid
sigma = self.sigma_act(self.sigma_head(pooled_features)) # Linear to 1 + sigmoid
# in each state continuous action can belong to different interval
indexes = action.repeat(1,1,2)
intervals = t.gather(intervals,1,indexes)
a,b = intervals[...,0].view(-1,1),intervals[...,1].view(-1,1)
mean = mu * (b-a) + a
std = sigma * (b-a) + a
dist = TruncatedNormal(mean, std,a,b)
continuous_action = (continuous_action
if continuous_action is not None
else dist.sample())
continuous_action_log_prob = dist.log_prob(continuous_action)
continuous_entropy = dist.entropy()
# variables started with need_ calculated earlier and equal to either 0 or 1:
# sometimes it's necessary to generate only one of the actions, not both
continuous_action = need_continuous * continuous_action
action_log_prob = need_continuous * continuous_action_log_prob + need_discrete * discrete_log_prob
entropy = need_continuous * continuous_entropy + need_discrete * entropy
return (action.long(), (continuous_action.float(),mean,std) ), action_log_prob.float(),entropy
I suspect self._mask
may output something detached from your action head input, or if you are using a binary mask of all 0 to multiply your action head then it is also possible to have no gradient going to your discrete network.
You can set visualize=True
when you initialize your DQN agent, a pdf file of your network gradient flow will be generated. Can you post that visualization here?
My self._mask
is masked_fill_
with fill_value=float('-inf')
so the output of softmax would be 0 for this values.
Attaching the visualization only for first part of the code with categorical distribution only (I temporary deleted the continuous part just for testing)
actor.pdf
P.S. The flow graph is complicated because actually conv1 is Graph convolution
The flow seems fine to me, at least after the part pf ReLUBackward0, which should correspond to your features = F.relu(self.conv1(state))
, so there probably is no problem in the framework.
I would recommend you take a look at machin.utils.checker.check_model
function, and setup some custom checker to find where gradient becomes zero.
Another possible cause is that the DQN algorithm may not optimize the discrete part, if by calling".parameters()" method on your actor model does not return your discrete net parameters. But this case is rare.
Yep, I think I found the problem. It's not connected to machin. Sorry for disturbance and thank you for your help.
No problem, happy to help you. :)
Hey, I'm trying to implement hybrid action space with A2C agent, maybe you have some advice. My expected output are two actions: one discrete, one continuous. Network predicts 3 things:
Net outputs sum of log probabilities of actions from both distributions (same for entropy). Network successfully learns the mean and std but the weight for the logits layers are not updated at all. What can be the reason?