Comparison btwn GaussianPolicy in PPO and SAC

Hi! I'm really enjoying and exploiting your team's implementation for my research.

Recently, I started to study variational inference to delve into entropy regularized policy algorithms,

by taking CS294 course!

In the lecture, the reparameterization trick can be recapitulated as above as far as I understand.

In the RL context, the 'x' may stand for state 's' and 'z' may stand for action 'a'.

So, what I learnt from the lecture is that the stochasticity now attributes to epsilon, not the whole

stochastic parameterized policy network.

The confusion I have had arisen from the description in,

as .

In PPO, the stddev is parameterized as

log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))

and In SAC, it's

log_std = tf.layers.dense(net, act_dim, activation=tf.tanh) .

I think both policies' stddev is parameterized though the only difference is whether it's dependent on

state or not as described above.

Then, as described in CS294 lecture,

the context

REINFORCE PG method is of high-variance due to stochasticity of policy

while Reparam. policy for SAC is of lower variance.

does not come to be clear since I think both have parameterized stddev (though that of PPO is independent of state ).

Can you help me with my confusion?!

Thanks in advance and wish good luck to your team's research!

openai / spinningup