hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.13k stars 724 forks source link

[Question] Default activation function for MLP Policy #616

Closed matthew-hsr closed 4 years ago

matthew-hsr commented 4 years ago

It seems that the default activation function for mlp policy is set to be tf.tanh (e.g. in class FeedForwardPolicy and class LstmPolicy in policies.py.

Correct me if I'm wrong, but isn't tanh well known for suffering from expensive calculation cost and vanishing gradient problem for deep networks? Is this default activation an informed choice for reinforcement learning algorithm or is it just randomly picked? Is there any particular situation in which tanh is superior to, say, relu?

Thanks in advance!

(If you have time, can you answer this quick question too?)

charles-blouin commented 4 years ago

You can easily change the default activation function by passing this argument for example:

policy_kwargs = dict(act_fun=tf.nn.tanh, net_arch=[32, 32])

Some papers mention that relu causes more issues than tanh when used outside of simulation:

"We implemented the policy with an MLP with two hidden layers, with 256 and 128 units each and tanh nonlinearity (Fig. 5). We found that the nonlinearity has a strong effect on performance on the physical system. Performance of two trained policies with different activation functions can be very different in the real world even when they perform similarly in simulation. Our explanation is that unbounded activation functions, such as ReLU, can degrade performance on the real robot, since actions can have very high magnitude when the robot reaches states that were not visited during training. Bounded activation functions, such as tanh, yield less aggressive trajectories when subjected to disturbances"

Source: Learning Agile and Dynamic Motor Skills for Legged Robots, HWANGBO et. al. about training their four-legged robot ANYmal.

Personally, I tried both activation functions in simulation, and I did not notice any practical training time or performance difference. It might be because the networks used for robotics are small compared to those used for audio or text processing.

araffin commented 4 years ago

Is there any particular situation in which tanh is superior to, say, relu?

This comes from hyperparameter optimization, you have a comparison here. As @charles-blouin mentioned, you can easily try to change the activation function. Btw, tanh is the default for A2C, ACER, PPO, TRPO but relu is the default for SAC, DDPG and TD3.

deep networks?

Most networks in RL are shallow (e.g. 2 fully connected layers in continuous action setting), so it does not make much difference.

matthew-hsr commented 4 years ago

Thanks a lot!