Implement PPO - Githubissues

keiohta commented 5 years ago

Proximal Policy Optimization Algorithms

keiohta commented 5 years ago

Scores on MuJoCo

Experimental setting

Random seeds: only one
NOTE: the environment version is different (ours: -v2, paper: -v1)

Commands:

$ git checkout 55494fe38e8db2e2b9f68add4783495be241292d
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name HalfCheetah-v2 --dir-suffix HalfCheetah
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Ant-v2 --dir-suffix Ant
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Hopper-v2 --dir-suffix Hopper
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name InvertedPendulum-v2 --dir-suffix InvertedPendulum
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name InvertedDoublePendulum-v2 --dir-suffix InvertedDoublePendulum
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Swimmer-v2 --dir-suffix Swimmer
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Reacher-v2 --dir-suffix Reacher
$ python examples/run_ppo.py --enable-gae --normalize-adv --env-name Walker2d-v2 --dir-suffix Walker2d

Scores

Task	tf2rl score	paper score
HalfCheetah-v2	4000	2000
Hopper-v2	1500	2200
InvertedDoublePendulum-v2	9360	8000
Inveted Pendulum-v2	1000	1000
Reacher-v2	-5	~-5
Swimmer-v2	40	120
Walker2D-v2	2000	3000
Ant-v2	0	-

keiohta commented 5 years ago

Implementation is done and supported PPO on >0.1.2, but not has been tested on Atari. So, close this issue after checking score on Atari.

benquick123 commented 4 years ago

I believe that running examples/run_ppo.py doesn't converge with this implementation. Or am I missing something?

keiohta commented 4 years ago

Hi @benquick123, thanks for your comment. I checked the results, and yeah, you are right.

It seems the problem is a hyper-parameter. The hyper-parameters of run_ppo.py is optimized to reproduce the original paper of MuJoCo experiments, and I tweaked discount factor from 0.99 to 0.9 makes the algorithm work.

You can see the algorithm work by the figures below (the training and test return converges near to zero), or you can reproduce the results from the following commands.

191127_ppo_results_tensorboard

$ git diff
diff --git a/examples/run_ppo.py b/examples/run_ppo.py
index f838f4d..27fe452 100644
--- a/examples/run_ppo.py
+++ b/examples/run_ppo.py
@@ -34,7 +34,7 @@ if __name__ == '__main__':
         n_epoch_critic=10,
         lr_actor=3e-4,
         lr_critic=3e-4,
-        discount=0.99,
+        discount=0.9,
         lam=0.95,
         horizon=args.horizon,
         normalize_adv=args.normalize_adv,

$ python examples/run_ppo.py
$ python examples/run_ppo.py --enable-gae --dir-suffix gae
$ python examples/run_ppo.py --normalize-adv --dir-suffix adv
$ python examples/run_ppo.py --enable-gae --normalize-adv --dir-suffix adv_gae
$ tensorboard --logdir results

keiohta commented 4 years ago

Sorry I did not show which line corresponds to which method. Please check following figure to see the difference in method (actually no big difference though).

191127_ppo_label

janbolle commented 4 years ago

Thank you very much for this nice implementation of ppo!

Also changing the following lines helps to actually learn:

        lr_actor=3e-4,
        lr_critic=1e-3,

keiohta commented 4 years ago

Hi @janbolle, thank you for your suggestion!

Most hyperparameters of my implementation is based on original paper, so sometimes you can get higher score by searching them by your own :)

keiohta commented 4 years ago

Reproduction results above are not correct because the number of steps is not same with paper (Deep Reinforcement Learning that Matters.

PPO
- Policy Network: (64, tanh, 64, tanh, Linear) + Standard Deviation variable; Value Network (64, tanh, 64, tanh, linear)
- Normalized observations with running mean filter
- Timesteps per batch 2048
- clip param = 0.2
- entropy coeff = 0.0
- Optimizer epochs per iteration = 10
- Optimizer step size 3e − 4
- Optimizer batch size 64
- Discount γ = 0.995, GAE λ = 0.97
- learning rate schedule is constant

keiohta / tf2rl

Implement PPO #17

Scores on MuJoCo