openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.64k stars 4.86k forks source link

PPO2 does not seem to work on continuous env Pendulum-v0 #696

Closed wonchul-kim closed 5 years ago

wonchul-kim commented 5 years ago

I executed baseline for "Pendulum-v0", however it does not work well...

I also tried it with changing the number of hidden nodes in layers into [16, 16]. However, it didn't work well, either.

Are there anyone solving on env, Pendulum-v0??????

pzhokhov commented 5 years ago

are you using default hyperparameters? I have tried it in the following way:

python -m baselines.run --alg=ppo2 --env=Pendulum-v0 --nminibatches=32 --noptepochs=10 --num_env=4 --num_timesteps=3e6 --play

and got mean reward per episode (eprewmean) of -170 and fairly decent behaviour. 3M timesteps seems rather many for Pendulum, so I am pretty sure one can make it work much faster (with fewer steps) by tuning the hyperparameters.

wonchul-kim commented 5 years ago

Thank for sharing your experience.

However, if the nminibatches is 32, there appears error for assertion of n_steps(200) % nminibatches == 0 . And it really seems too many episode required for getting -170 reward...

2018년 11월 2일 (금) 오전 10:09, pzhokhov notifications@github.com님이 작성:

are you using default hyperparameters? I have tried it in the following way:

python -m baselines.run --alg=ppo2 --env=Pendulum-v0 --nminibatches=32 --noptepochs=10 --num_env=4 --num_timesteps=3e6 --play

and got mean reward per episode (eprewmean) of -170 and fairly decent behaviour. 3M timesteps seems rather many for Pendulum, so I am pretty sure one can make it work much faster (with fewer steps) by tuning the hyperparameters.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/696#issuecomment-435238652, or mute the thread https://github.com/notifications/unsubscribe-auth/AT7jwZWIPSQ-KFpHqZYpW-t9I6ihkQkKks5uq5tYgaJpZM4YF7Zj .

pzhokhov commented 5 years ago

Assertion error - yes, because nbatch = nsteps * nenvs has to be a multiple of nminibatches (otherwise, the training batch cannot be divided equally into minibatches, and last minibatch gets noisy gradient, which is usually undesirable). As for number of steps being too large - I agree; so if you find a set of hyperparameters that works faster, feel free to share here :)

luanyun commented 5 years ago

This ppo2 seems like a MC-based method. So its slow learning is understandable. I think a TD version would be faster. Want n_step be smaller? I think the gamma should be smaller, maybe set gamma=0.9, then you can use nstep=256 0.99^2048=1.15e-9 0.99^256=0.0763 0.9^256=1.96e-12 num_timesteps=1e6 is enough

wonchul-kim commented 5 years ago

Thank you for your sharing.

hmm.. Could you explain more in detail? I don't understand the calculation 0.99^2048=1.15e-9 0.99^256=0.0763 0.9^256=1.96e-12 what does it have to do with n_step and num_timesteps? How could you determine gamma and n_step as 0.9 and 256 respectively? and also num_timesteps=1e6?

2018년 11월 8일 (목) 오전 12:26, luanyun notifications@github.com님이 작성:

This ppo2 seems like a MC-based method. So its slow learning is understandable. I think a TD version would be faster. Want n_step be smaller? I think the gamma should be smaller, maybe set gamma=0.9, then you can use nstep=256 0.99^2048=1.15e-9 0.99^256=0.0763 0.9^256=1.96e-12 num_timesteps=1e6 is enough

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/696#issuecomment-436661209, or mute the thread https://github.com/notifications/unsubscribe-auth/AT7jwYD7CbDhieexxdJME-s821sS7pMFks5usvuNgaJpZM4YF7Zj .

luanyun commented 5 years ago

in ppo2, if you set nstep=256, then in each episode it will only run 256 steps, then the environment will be reset.
0.99^256=0.0763 it means the estimate value of the last step provides more than 7% (7%~99%) in updating. but this value is not accurate.

pzhokhov commented 5 years ago

@luanyun @wonchul-kim baselines ppo2 uses generalized advantage estimation (GAE), so it is technically neither MC nor 1-step TD; it can be controlled smoothly between the two by using lam (corresponding to GAE lambda) parameter and nsteps. You can experiment and effectively make it 1-step TD by setting lam=0.0, or make it exactly n-step-TD by setting lam=1.0 (and varying nstep). (Please refer to https://arxiv.org/abs/1707.06347 paper or code in class ppo2.Runner for specifics).

gamma parameter controls how much does the learning prefer rewards now as opposed to later (this includes both real rewards obtained from the environment, and remaining returns to go estimated at the end of the rollout via the value function approximation). So I would say @luanyun's estimates show the following - if gamma=0.9, then at any step agent does not care at all what happens 256 steps in the future (as 0.9 ^ 256 ~ 1.9e-12, so even if agent were to get a giant reward / value function estimate of 1e12 256 steps in the future, it will mean the same as reward 1.9 at the current timestep). I suspect that would lead to agent trying to swing pendulum with max force towards the top (as that would lead to maximizing short-term reward), not caring whether it will fall back afterwards.

Also worth noting that while nsteps parameter controls number of steps in a rollout; but the environment is not reset after nsteps, it is rather reset whenever "done" is returned (i.e. if nsteps=256 and environment did not return "done" at step 255, at a next rollout it will take off from where ever it was left).

In terms of making it work with minimal number of timesteps - worth experimenting (to which end, feel free to post here hyperparameters that work, but post complete sets please to make it reproducible). I found that aggressive optimization each step seems to help. For instance,

python -m baselines.run --alg=ppo2 --env=Pendulum-v0 --num_env=4 --nminibatches=1 --noptepochs=256 --num_timesteps=5e5 --play

(in other words, compute gradients on the entire batch of experience, and take 256 optimizer steps each rollout) gets reward of about -230 and looks fairly decent visually (and uses total of 0.5M steps) Closing this (as I don't think this is an issue with ppo2 implementation); if you find better HP's - please post here, and we can include them in the docs.