Open keiohta opened 5 years ago
-v2
, paper: -v1
)$ git checkout 55494fe38e8db2e2b9f68add4783495be241292d
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name HalfCheetah-v2 --dir-suffix HalfCheetah
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Ant-v2 --dir-suffix Ant
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Hopper-v2 --dir-suffix Hopper
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name InvertedPendulum-v2 --dir-suffix InvertedPendulum
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name InvertedDoublePendulum-v2 --dir-suffix InvertedDoublePendulum
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Swimmer-v2 --dir-suffix Swimmer
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Reacher-v2 --dir-suffix Reacher
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Walker2d-v2 --dir-suffix Walker2d
Task | tf2rl score | paper score |
---|---|---|
HalfCheetah-v2 | 4000 | 2000 |
Hopper-v2 | 1500 | 2200 |
InvertedDoublePendulum-v2 | 9360 | 8000 |
Inveted Pendulum-v2 | 1000 | 1000 |
Reacher-v2 | -5 | ~-5 |
Swimmer-v2 | 40 | 120 |
Walker2D-v2 | 2000 | 3000 |
Ant-v2 | 0 | - |
Implementation is done and supported PPO on >0.1.2, but not has been tested on Atari. So, close this issue after checking score on Atari.
I believe that running examples/run_ppo.py
doesn't converge with this implementation. Or am I missing something?
Hi @benquick123, thanks for your comment. I checked the results, and yeah, you are right.
It seems the problem is a hyper-parameter. The hyper-parameters of run_ppo.py
is optimized to reproduce the original paper of MuJoCo experiments, and I tweaked discount factor from 0.99
to 0.9
makes the algorithm work.
You can see the algorithm work by the figures below (the training and test return converges near to zero), or you can reproduce the results from the following commands.
$ git diff
diff --git a/examples/run_ppo.py b/examples/run_ppo.py
index f838f4d..27fe452 100644
--- a/examples/run_ppo.py
+++ b/examples/run_ppo.py
@@ -34,7 +34,7 @@ if __name__ == '__main__':
n_epoch_critic=10,
lr_actor=3e-4,
lr_critic=3e-4,
- discount=0.99,
+ discount=0.9,
lam=0.95,
horizon=args.horizon,
normalize_adv=args.normalize_adv,
$ python examples/run_ppo.py
$ python examples/run_ppo.py --enable-gae --dir-suffix gae
$ python examples/run_ppo.py --normalize-adv --dir-suffix adv
$ python examples/run_ppo.py --enable-gae --normalize-adv --dir-suffix adv_gae
$ tensorboard --logdir results
Sorry I did not show which line corresponds to which method. Please check following figure to see the difference in method (actually no big difference though).
Thank you very much for this nice implementation of ppo!
Also changing the following lines helps to actually learn:
lr_actor=3e-4,
lr_critic=1e-3,
Hi @janbolle, thank you for your suggestion!
Most hyperparameters of my implementation is based on original paper, so sometimes you can get higher score by searching them by your own :)
Reproduction results above are not correct because the number of steps is not same with paper (Deep Reinforcement Learning that Matters.
Proximal Policy Optimization Algorithms