Open garymcintire opened 5 years ago
Are you sure all the hyperparameters match?
@garymcintire Comparing your call with the table in the appendix of Schulman et al paper, I can see that hyperparameters don't quite match. For instance, number of actors in your call is 10, whereas in paper it should be 32 (--num_env=32), number of steps per actor (horizon) is 512 in the paper, so nsteps should be 512*32=16384. Minibatch size in the paper is 4096, so nminibatches should be 16384 / 4096 = 4 (but it is that by default), number of epochs in the paper is 15 (--noptepochs=15). In the current implementation of ppo2 it is not possible to linearly anneal policy variance as in the paper, but I am not convinced it will decrease performance dramatically.
That being said, this is not the first time the running humanoid is requested with PPO2, so I'll work on it and post the results here.
Hi @pzhokhov , Have you finished with the Roboschool experiment. I have used exactly the hyperparameters as in the paper. I managed to get the mean reward to 3000. I have modified the code for the adaptive learning rate based on KL divergence. I also think that the logstd=zeros is not a big problem. Do you have any suggestion? I also tried the RoboschoolHumanoidFlagrunHarder-v1 for 100M timesteps, the reward is around 1200 with pretty high variance (maximum reward is close to 2000). The learning process starts pretty fast but slower from 30M timesteps. Let me know if you have successfully replicated the results. Thanks.
Hi @doviettung96 ! No, unfortunately it is still on my todo list :/ My understanding is that logstd=zeros may cripple learning, especially towards the end - the action range is -1, 1 in roboschool humanoids, and if you have logstd=0 that means that action distribution has std=1 (that is why in the PPO paper it is annealed to -1.6). That being said, experiment takes precedence over gut feelings - if yours shows that logstd=0 works just as well, so it is then :)
Hi @pzhokhov , From my experiment, excluding entropy in the cost function, using adaptive learning rate, using VecNormalize increase the performance vastly. However, from that point, I get stuck. It is still a difference of 1000 in the mean reward to go. I might think about Anneal for the logstd even if it sounds fine with logstd=0. Maybe it is the one that cause my learning from 30M slow. Do you have any idea to change the logstd? Or just modify the algorithm? Thanks.
On the subject of logstd annealing - unless modified, logstd is not zeros in ppo2, but is learnable initialized at zeros. Moreover, ppo2 reports policy entropy (which, for gaussian policy, is logstd + constant term) - I think it could be handy to look at that and see how it changes during training - maybe it already goes down. If you decide to include annealing after all, I don't think there is a way to do that without code modifications, but those could be relatively minor - something along the lines of
self.logstd = tf.get_variable('ppo2_model/pi/logstd')
in the model constructor, and then
model.sess.run(tf.assign(model.logstd, <current_logstd_value>))
in main ppo loop.
Let me know if you'd like me to elaborate more on that
About entropy, because I have excluded it in the loss function, then entropy goes down. So in that sense, the logstd must have gone down also. I will try that approach. One more question: with the change of self (model).logstd value in the main PPO loop, could that also change the one in Distribution which is what exactly we want to change? I guess the answer is yes. I am not quite familiar with changing variable in this way. Thank you.
yes, they should be both pointing at the same tensorflow variable
@pzhokhov , Actually the nsteps should be 512 because it is the number of timesteps each vectorized env collects per iteration. For other guys, DO stick to the paper and you will definitely get the high reward. For details not in the paper:
I'm new with baselines ppo2 but I cannot seem to get anything close to the paper results. Can you help me?
I installed as described with pip install -e . No problems
As described, I execute... python -u -m baselines.run --alg=ppo2 --env=RoboschoolHumanoid-v1 --network=mlp --num_timesteps=2e9 --num_env 10 # No problems
Overnight I'm at 143Million timesteps and eprewmean is only 140 or so
| approxkl | 0.021461453 | | clipfrac | 0.29248047 | | eplenmean | 106 | | eprewmean | 140 | | explained_variance | 0.942 | | fps | 2287 | | nupdates | 6990 | | policy_entropy | -26.85582 | | policy_loss | 0.0070732697 | | serial_timesteps | 14315520 | | time_elapsed | 6.3e+04 | | total_timesteps | 143155200 | | value_loss | 35.016712 |
Schulman, et al paper https://arxiv.org/pdf/1707.06347.pdf claims to get over 4000 in 50M timesteps on RoboschoolHumanoid-v0(see the chart toward the end) using PPO2 Granted I am using RoboschoolHumanoid-v1 and he used v0. But I'm at 143M timesteps and v1 isn't even taking steps yet.
I have used another ppo to get the v1 into the low 2000 episode reward with this many timesteps.
Do others see this issue? What am I doing wrong??? Any help appreciated