keiohta / tf2rl

TensorFlow2 Reinforcement Learning
MIT License
465 stars 103 forks source link

Implement PPO #17

Open keiohta opened 5 years ago

keiohta commented 5 years ago

Proximal Policy Optimization Algorithms

keiohta commented 5 years ago

Scores on MuJoCo

Task tf2rl score paper score
HalfCheetah-v2 4000 2000
Hopper-v2 1500 2200
InvertedDoublePendulum-v2 9360 8000
Inveted Pendulum-v2 1000 1000
Reacher-v2 -5 ~-5
Swimmer-v2 40 120
Walker2D-v2 2000 3000
Ant-v2 0 -
keiohta commented 5 years ago

Implementation is done and supported PPO on >0.1.2, but not has been tested on Atari. So, close this issue after checking score on Atari.

benquick123 commented 4 years ago

I believe that running examples/run_ppo.py doesn't converge with this implementation. Or am I missing something?

keiohta commented 4 years ago

Hi @benquick123, thanks for your comment. I checked the results, and yeah, you are right.

It seems the problem is a hyper-parameter. The hyper-parameters of run_ppo.py is optimized to reproduce the original paper of MuJoCo experiments, and I tweaked discount factor from 0.99 to 0.9 makes the algorithm work.

You can see the algorithm work by the figures below (the training and test return converges near to zero), or you can reproduce the results from the following commands.

191127_ppo_results_tensorboard

$ git diff
diff --git a/examples/run_ppo.py b/examples/run_ppo.py
index f838f4d..27fe452 100644
--- a/examples/run_ppo.py
+++ b/examples/run_ppo.py
@@ -34,7 +34,7 @@ if __name__ == '__main__':
         n_epoch_critic=10,
         lr_actor=3e-4,
         lr_critic=3e-4,
-        discount=0.99,
+        discount=0.9,
         lam=0.95,
         horizon=args.horizon,
         normalize_adv=args.normalize_adv,

$ python examples/run_ppo.py
$ python examples/run_ppo.py --enable-gae --dir-suffix gae
$ python examples/run_ppo.py --normalize-adv --dir-suffix adv
$ python examples/run_ppo.py --enable-gae --normalize-adv --dir-suffix adv_gae
$ tensorboard --logdir results
keiohta commented 4 years ago

Sorry I did not show which line corresponds to which method. Please check following figure to see the difference in method (actually no big difference though).

191127_ppo_label

janbolle commented 4 years ago

Thank you very much for this nice implementation of ppo!

Also changing the following lines helps to actually learn:

        lr_actor=3e-4,
        lr_critic=1e-3,
keiohta commented 4 years ago

Hi @janbolle, thank you for your suggestion!

Most hyperparameters of my implementation is based on original paper, so sometimes you can get higher score by searching them by your own :)

keiohta commented 4 years ago

Reproduction results above are not correct because the number of steps is not same with paper (Deep Reinforcement Learning that Matters.