hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

MlpPolicy network output layer softmax activation for continuous action space problem? #1190

Open wbzhang233 opened 10 months ago

wbzhang233 commented 10 months ago

In the case of continuous action space problem, we could use PPO\A2C algorithm to predict continuous aciton, but I want to custom softmax as my output activation function with net_arch=[256,256]. I have read and test the tutorial post. When I test the code below, I found the action is not sum up to one. the softmax function don't work. I found that the action_net in mode.policy, but I could not use softmax as the custom activation function.

policy_kwargs = {
    "activation_fn":  torch.nn.Softmax,
    "net_arch": [256, 256]
}
model = PPO('MlpPolicy', env, policy_kwargs, verbose=1)

How to use softmax as customized activation function of the action output layer?

wbzhang233 commented 10 months ago

I want to make the action of PPO represent the probability, so I need use softmax as activation function. Totaly a continuous action space problem.

rambo1111 commented 9 months ago

https://github.com/hill-a/stable-baselines/issues/1192