Open jaanli opened 6 years ago
I think that this is intended (see the PPO paper). Just the name of the parameter is suboptimal. Variance is fixed in the sense it does not depend on the observations/states.
On 17 Nov 2017 17:04, "Jaan Altosaar" notifications@github.com wrote:
In ppo1/mlp_policy.py:
When gaussian_fixed_var=True (supposedly to make the gaussian policy have fixed variance), tf.get_variable is called to get the logstd of the gaussian.
However, tf.get_variable has a default argument, trainable=True, which means the variance is learned.
https://github.com/openai/baselines/blob/bb403781182c6e31d3bf5de16f42b0 cb0d8421f7/baselines/ppo1/mlp_policy.py#L34
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/211, or mute the thread https://github.com/notifications/unsubscribe-auth/AEuSHfBOmtw1jh9pRakhvEeEqVuE8Ee1ks5s3a5-gaJpZM4QiQJC .
Interesting, thanks!
I looked at the paper, and was getting confused.
The paper states that they use an entropy penalty, but this appears not to be the case for continuous control tasks (https://github.com/openai/baselines/blob/bb403781182c6e31d3bf5de16f42b0cb0d8421f7/baselines/ppo1/run_mujoco.py#L21)
To test the entropy penalty, we tried a nonzero coefficient with gaussian_fixed_var={True, False}
in MuJoCo environments. Both settings led to unstable learning and nans in gradients.
I didn't try to replicate the PPO paper myself (but see the RL that matters paper and their github repo) but used this code for a different project and virtually never encountered nans. Keep in mind also that this repo had initially some bugs that have been corrected in the recent months. For entropy you need very small coefficients like 0.001.
On 17 Nov 2017 5:42 p.m., "Jaan Altosaar" notifications@github.com wrote:
Interesting, thanks!
I looked at the paper, and was getting confused.
The paper states that they use an entropy penalty, but this appears not to be the case for continuous control tasks (https://github.com/openai/ baselines/blob/bb403781182c6e31d3bf5de16f42b0 cb0d8421f7/baselines/ppo1/run_mujoco.py#L21)
To test the entropy penalty, we tried a nonzero coefficient with gaussian_fixed_var={True, False} in MuJoCo environments. Both settings led to unstable learning and nans in gradients.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/211#issuecomment-345296192, or mute the thread https://github.com/notifications/unsubscribe-auth/AEuSHUo9wqaQJZrtBKHw2soRBE8OEYloks5s3beDgaJpZM4QiQJC .
Thanks for the info @wjaskowski!! Did you use the entropy coefficient of 0.001 in MuJoCo environments?
I didn't use MuJoCo at all. For the problem I was solving, entropy was not necessary (although I experimented with some values so I know what is a reasonable range of this coef).
On 17 November 2017 at 18:44, Jaan Altosaar notifications@github.com wrote:
Thanks for the info @wjaskowski https://github.com/wjaskowski!! Did you use the entropy coefficient of 0.001 in MuJoCo environments?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/211#issuecomment-345313423, or mute the thread https://github.com/notifications/unsubscribe-auth/AEuSHeNVvsuUxWVsBHbQ54yoQKmn3vZiks5s3cYMgaJpZM4QiQJC .
In
ppo1/mlp_policy.py
:When
gaussian_fixed_var=True
(supposedly to make the gaussian policy have fixed variance),tf.get_variable
is called to get thelogstd
of the gaussian.However,
tf.get_variable
has a default argument,trainable=True
, which means the variance is learned.https://github.com/openai/baselines/blob/bb403781182c6e31d3bf5de16f42b0cb0d8421f7/baselines/ppo1/mlp_policy.py#L34