Closed mch5048 closed 5 years ago
Hi! I'm really enjoying and exploiting your team's implementation for my research.
Recently, I started to study variational inference to delve into entropy regularized policy algorithms,
by taking CS294 course!
In the lecture, the reparameterization trick can be recapitulated as above as far as I understand.
In the RL context, the 'x' may stand for state 's' and 'z' may stand for action 'a'.
So, what I learnt from the lecture is that the stochasticity now attributes to epsilon, not the whole
stochastic parameterized policy network.
The confusion I have had arisen from the description in,
https://spinningup.openai.com/en/latest/algorithms/sac.html
as .
In PPO, the stddev is parameterized as
log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))
and In SAC, it's
log_std = tf.layers.dense(net, act_dim, activation=tf.tanh) .
log_std = tf.layers.dense(net, act_dim, activation=tf.tanh)
I think both policies' stddev is parameterized though the only difference is whether it's dependent on
state or not as described above.
Then, as described in CS294 lecture,
the context
REINFORCE PG method is of high-variance due to stochasticity of policy
while Reparam. policy for SAC is of lower variance.
does not come to be clear since I think both have parameterized stddev (though that of PPO is independent of state ).
Can you help me with my confusion?!
Thanks in advance and wish good luck to your team's research!
@mch5048 Hi, I also feel confused reading the description that you highlighted here. Did you solve the problem?
Hi! I'm really enjoying and exploiting your team's implementation for my research.
Recently, I started to study variational inference to delve into entropy regularized policy algorithms,
by taking CS294 course!
In the lecture, the reparameterization trick can be recapitulated as above as far as I understand.
In the RL context, the 'x' may stand for state 's' and 'z' may stand for action 'a'.
So, what I learnt from the lecture is that the stochasticity now attributes to epsilon, not the whole
stochastic parameterized policy network.
The confusion I have had arisen from the description in,
https://spinningup.openai.com/en/latest/algorithms/sac.html
as .
In PPO, the stddev is parameterized as
log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))
and In SAC, it's
log_std = tf.layers.dense(net, act_dim, activation=tf.tanh)
.I think both policies' stddev is parameterized though the only difference is whether it's dependent on
state or not as described above.
Then, as described in CS294 lecture,
the context
REINFORCE PG method is of high-variance due to stochasticity of policy
while Reparam. policy for SAC is of lower variance.
does not come to be clear since I think both have parameterized stddev (though that of PPO is independent of state ).
Can you help me with my confusion?!
Thanks in advance and wish good luck to your team's research!