hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.15k stars 723 forks source link

[question] SAC Policy Implementation #173

Closed pvarin closed 5 years ago

pvarin commented 5 years ago

The original SAC paper describes the actor as

  1. Sample noise, eps, from some distribution (e.g. unit gaussian from the SQL): eps ~ N(0,I)
  2. Push the sample through the policy network: a = f(eps, s)

but it looks like the current implementation does something different

  1. Generate the mean from the policy network: mu = f(s)
  2. Sample the action from a normal centered at the mean: a ~ N(mu, sigma)

These are pretty qualitatively different policies, the first one can express bimodal distributions, whereas the second cannot. What's the motivation for implementing the second version?

pvarin commented 5 years ago

Okay, I think I answered this one for myself. Reading appendix C of the paper they say they actually use the second method described above. I guess technically it's a subset of the first method where noise is only injected after the last layer in the network. This also makes it easier to compute the action probability, which is necessary to evaluate the policy loss function.

araffin commented 5 years ago

Hello, You raised an interesting point (I missed it when reading the paper, maybe it was added along with the two q networks). The main reason is that this type of gaussian policy was used in both original and spinning up implementations, and was the one making more sense to me. Looking at the new softqlearning repo, they seem however to use the first method. And, yes, i was wondering how do you compute the log likelihood with the method 1?

pvarin commented 5 years ago

Yeah, I think it would be difficult to get a log-likelihood with method 1. The SQL formulation uses "Stein variational gradient descent" and doesn't require needing to compute the action probabilities.

araffin commented 5 years ago

Ok, thanks (I think i need to check the sql paper then).