[question] Custom output activation function for actor (A2C, SAC)

stefanbschneider commented 5 years ago

I really like stable_baselines and would like to use it for a custom environment with continuous actions. To match the specific needs of the environment, I need to apply a custom activation function to the output of the policy/actor network. In particular, I want to apply separate softmax activation functions to different parts of the output (e.g., softmax(first n action), then softmax(next n), etc).

I know how to define such an activation in general, but don't know what the best and cleanest way is to implement such a policy in stable_baselines. I'd like to reuse the MlpPolicy and just change the activation of the output layer. I'm interested in using this with A2C and SAC.

In A2C, it seems like this is handled here or here. But I don't want to mess something up making changes there without being certain.

In SAC, I guess I would only have to adjust this part: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/sac/policies.py#L217 Or do I need to change the log_std below as well?

This seems related to this issue. Unfortunately, it didn't help me figure out my problem/question.

araffin commented 5 years ago

functions to different parts of the output (e.g., softmax(first n action), then softmax(next n), etc).

You mean you want to use a Multidiscrete action space? This is handled automatically in SB for A2C.

SAC only support continuous actions for now.

stefanbschneider commented 5 years ago

Thanks for the hint, but no, I don't want to use a multi discrete action space. I do want continuous actions, just with the additional constraint that some actions should some up to one, which is necessary for the specific use case.

Eg, let's say I have 4 continuous actions and action 1 and 2 should sum up to 1 and action 3 and 4 also to 1. In a different framework (using DDPG), I just created a separate softmax activation functions for a) action 1 and 2 and b) for action 3 and 4 to make sure that each sum up to 1. That worked quite well. Now my idea was to do a comparison to other algorithms, implemented in SB, and do the same there.

In general, how do I change the activation function of the output layer in SB? Or is this really bad practice somehow?

araffin commented 5 years ago

just created a separate softmax activation functions for a) action 1 and 2 and b) for action 3 and 4 to make sure that each sum up to 1

Can't you do the normalization inside the environment? Or using a gym wrapper (cf tutorial)?

You also have to know that doing so, you change the probability distribution (for SAC or A2C, in the case of DDPG/TD3, there is no, so it is not a problem).

stefanbschneider commented 5 years ago

Can't you do the normalization inside the environment?

Yes, good point, I can. For DDPG it seemed that normalizing the actions in the environment leads to worse results. I assumed, because the actions put into the memory buffer did not match the normalized actions that were really applied to the environment.

But maybe this works better for A2C and SAC - I'll test. Also thanks for the tutorial, I'll check that too.

You also have to know that doing so, you change the probability distribution (for SAC or A2C, in the case of DDPG/TD3, there is no, so it is not a problem).

Sorry, could you elaborate a bit more what you mean? I read the papers but didn't get all of it. How/where are SAC and A2C using a probability distribution. I saw the Gaussian distribution for continuous actions in the SB code, but didn't quite get the point of it.

Thanks for the prompt reply and help, btw!

araffin commented 5 years ago

I saw the Gaussian distribution for continuous actions in the SB code, but didn't quite get the point of it.

Best answer to that question is to read Spinning Up guide. This is quite important to understand when using RL with continuous actions.

For DDPG it seemed that normalizing the actions in the environment

If you want to have normalized values in the replay buffer, then you just need to derive from DDPG class and normalize the action right after the action prediction. You can have an example here of a custom class deriving deriving from SB model.

stefanbschneider commented 5 years ago

Ok, thanks! I'll see how far I get with normalizing inside the environment.

wbzhang233 commented 10 months ago

Ok, thanks! I'll see how far I get with normalizing inside the environment.

Hey, my buddy. I have the same question with you. I want to custom a softmax activation function of the action output layer, just make the output sum up to one, it's a continuous action space problem. But it seems SB3 do not support custom activation function of action output layer. Do you solve you problem?

stefanbschneider commented 10 months ago

Hi @wbzhang233 , I'm not sure if I ended up using sb2 or something else. This is what I did in a related project: https://github.com/RealVNF/DeepCoord/blob/ef18520abfed0c0f66a6723dbdb35e2d549d339a/src/rlsp/agents/rlsp_ddpg.py#L51 (without sb)

hill-a / stable-baselines

[question] Custom output activation function for actor (A2C, SAC) #552