openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.01k stars 2.21k forks source link

Comparison btwn GaussianPolicy in PPO and SAC #107

Closed mch5048 closed 5 years ago

mch5048 commented 5 years ago

Hi! I'm really enjoying and exploiting your team's implementation for my research.

Recently, I started to study variational inference to delve into entropy regularized policy algorithms,

by taking CS294 course!

image

In the lecture, the reparameterization trick can be recapitulated as above as far as I understand.

In the RL context, the 'x' may stand for state 's' and 'z' may stand for action 'a'.

So, what I learnt from the lecture is that the stochasticity now attributes to epsilon, not the whole

stochastic parameterized policy network.

The confusion I have had arisen from the description in,

https://spinningup.openai.com/en/latest/algorithms/sac.html

as image.

In PPO, the stddev is parameterized as

log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))

and In SAC, it's

log_std = tf.layers.dense(net, act_dim, activation=tf.tanh) .

I think both policies' stddev is parameterized though the only difference is whether it's dependent on

state or not as described above.

Then, as described in CS294 lecture, image

the context

REINFORCE PG method is of high-variance due to stochasticity of policy

while Reparam. policy for SAC is of lower variance.

does not come to be clear since I think both have parameterized stddev (though that of PPO is independent of state ).

Can you help me with my confusion?!

Thanks in advance and wish good luck to your team's research!

xiaojingli commented 4 years ago

@mch5048 Hi, I also feel confused reading the description that you highlighted here. Did you solve the problem?