MlpPolicy network output layer softmax activation for continuous action space problem?

wbzhang233 commented 10 months ago

In the case of continuous action space problem, we could use PPO\A2C algorithm to predict continuous aciton, but I want to custom softmax as my output activation function with net_arch=[256,256]. I have read and test the tutorial post. When I test the code below, I found the action is not sum up to one. the softmax function don't work. I found that the action_net in mode.policy, but I could not use softmax as the custom activation function.

policy_kwargs = {
    "activation_fn":  torch.nn.Softmax,
    "net_arch": [256, 256]
}
model = PPO('MlpPolicy', env, policy_kwargs, verbose=1)

How to use softmax as customized activation function of the action output layer?

wbzhang233 commented 10 months ago

I want to make the action of PPO represent the probability, so I need use softmax as activation function. Totaly a continuous action space problem.

rambo1111 commented 9 months ago

https://github.com/hill-a/stable-baselines/issues/1192

hill-a / stable-baselines

MlpPolicy network output layer softmax activation for continuous action space problem? #1190