Closed zplizzi closed 4 years ago
I see the same behavior with the PPO algorithm, here's the log output at ~5M steps using this tuned config file:
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 33/103 CPUs, 1/3 GPUs, 0.0/351.22 GiB heap, 0.0/77.25 GiB objects
Memory usage on this node: 14.1/60.0 GiB
Result logdir: /root/ray_results/pong-ppo
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
- PPO_PongNoFrameskip-v4_0: RUNNING, [33 CPUs, 1 GPUs], [pid=33], 1253 s, 1042 iter, 5210000 ts, -20.3 rew
Hmm, I tried a couple more of the tuned configs and can't get them to train, either.
atari-ppo.yaml
at 7M timesteps (edited to play PongNoFrameskip-v4):
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 11/103 CPUs, 1/3 GPUs, 0.0/351.22 GiB heap, 0.0/77.25 GiB objects
Memory usage on this node: 31.4/60.0 GiB
Result logdir: /root/ray_results/atari-ppo
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
- PPO_PongNoFrameskip-v4_0_env=PongNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=31], 5848 s, 1463 iter, 7315000 ts, -20.6 rew
atari-apex.yaml
at 10M timesteps (only Qbert):
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 17/103 CPUs, 1/3 GPUs, 0.0/351.22 GiB heap, 0.0/77.25 GiB objects
Memory usage on this node: 48.8/60.0 GiB
Result logdir: /root/ray_results/apex
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
- APEX_QbertNoFrameskip-v4_0_env=QbertNoFrameskip-v4: RUNNING, [17 CPUs, 1 GPUs], [pid=34], 6208 s, 192 iter, 10721120 ts, 286 rew
Here's the Dockerfile I'm using to run these experiments (in an AWS Kubernetes cluster with p3.2xlarge and c5n.9xlarge instances): https://gist.github.com/zplizzi/17e49ffabff848d16973e83277dac425 It was built a couple days ago so should have roughly the most recent stable versions of all the installed packages.
I can reproduce this. Trying to figure out when this first broke.
It seems it works in Ray 0.7.4.
RUNNING trials:
- PPO_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=84978], 274 s, 73 iter, 365000 ts, 7.45 rew
- PPO_BreakoutNoFrameskip-v4_1_env=BreakoutNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=84992], 275 s, 74 iter, 370000 ts, 9.37 rew
- PPO_BreakoutNoFrameskip-v4_2_env=BreakoutNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=85027], 274 s, 74 iter, 370000 ts, 3.6 rew
- PPO_BreakoutNoFrameskip-v4_3_env=BreakoutNoFrameskip-v4: RUNNING, [11 CPUs, 1 GPUs], [pid=85018], 277 s, 75 iter, 375000 ts, 12.1 rew
But not 0.7.5+ (<= 2 reward for Breakout no matter how long).
Cool, 0.7.4 seems to work for me also.
@michaelzhiluo any updates?
I tested PPO on Pong, SpaceInvaders, and Breakout on 0.8.0. It works on 0.8.0 :).
Pong reached 18-19 reward in ~2.5-3.0 million timesteps. I got it to work by setting kl_coeff: 0
. That is the only changed need for the file in tuned_examples/pong-ppo.yaml
. Looks like the trust region is hurting learning, since it is preventing the agent from making big steps. The parameter to performance space most likely requires a big jump to reach higher rewards in Pong (as in 18-20 rew).
For other environments: SpaceInvaders reaches 900 reward in 14 million timesteps. Breakout PPO does not work as well as other agents, but can attain 200 reward in 15-20 million timesteps, as in the original PPO paper.
@michaelzhiluo interesting. We haven't really touched the KL penalty though so I don't know why you would suddenly need to set the coeff to zero. Were you able to identify the commit(s) that broke / fixed the issue?
Not sure, the configuration parameters for pong-ppo.yaml
is subtly different between 0.7.4 and newer versions of Ray. This includes gamma and learning rate. Most likely, it is just hyperparameter difference.
Can you figure out which commit caused it with git bisect
? We need to root cause the issue otherwise there could be other problems.
By the way, pong-ppo.yaml has not changed for the entire history it has been checked in.... https://github.com/ray-project/ray/commits/5d7afe8092f5521a8faa8cdfc916dcd9a5848023/rllib/tuned_examples/pong-ppo.yaml
Try this commit: https://github.com/ray-project/ray/commit/d8205976e8e5c98e177c0d1d03fefef21a69d5d9#diff-6921e691057d93f6a4d6507ad9012352. What is different is the LR and Gamma, which is not included in this old version of the yaml file. Looking at agents/ppo/ppo.py
, we see that lr
is 0.00005
, a lot smaller than the current yaml file.
Looks like all prior commits is hard to find since Rllib was moved to the home directory. You have to go back to the commit right before that happened to find all prior commits: https://github.com/ray-project/ray/commits/384cbfb21140aad820b3c72c4624edc3cf08beb2/python/ray/rllib/tuned_examples/pong-ppo.yaml
The problem probably also affects other algos. To replicate results, I'm using settings from rl-experiments. For below command line on code from master
I get this chart for Breakout which seems wrong as reward seems to be stuck at 2.0.
rllib train -f atari-apex/atari-apex.yaml
In some other testing I found that worker.get_policy().model
returns a fully-connected model type in 0.7.6, and a vision-type model in 0.7.4. So probably that's what's causing these issues?
This is the worker that I was testing with:
worker = ray.rllib.evaluation.RolloutWorker(
env_creator=lambda _: gym.make("PongNoFrameskip-v4"),
policy=AsyncPPOTFPolicy,
model_config=model_config,
policy_config=policy_config)
with the standard model and policy config for the Atari-PPO examples.
Thanks @zplizzi , that was the issue: https://github.com/ray-project/ray/pull/6087
@michaelzhiluo , it's quite miraculous you got Atari working with fcnet then...
@ericl I think only Pong works. I did fresh runs on code from master (blue line) and releases/0.7.4 branch (red line) for Breakout with DQN. As it can be seen current master doesn't progress beyond score of 2.0.
@zplizzi - this would explain why Apex and DQN are also broken.
Command:
# from rl-experiments repo
rllib train -f atari-dqn/dueling-ddqn.yaml
@sytelus can you try the fix above (https://github.com/ray-project/ray/pull/6087?)
@ericl - trying this out. I ran this just to make sure all bits are new after pull but got error:
bazel build //:ray_pkg
Error:
make[1]: Leaving directory '/data/home/shitals/.cache/bazel/_bazel_shitals/db2d1aec9f739be4812e721840071a44/sandbox/linux-sandbox/2500/execroot/com_github_ray_project_ray/src'
ERROR: /data/home/shitals/GitHubSrc/ray/BUILD.bazel:750:1: Executing genrule //:python/ray/_raylet.pyx_cython_translation failed (Exit 1) bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox
/bin/bash: PYTHON_BIN_PATH: unbound variable
Target //:ray_pkg failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 132.736s, Critical Path: 86.41s
INFO: 887 processes: 887 linux-sandbox.
FAILED: Build did NOT complete successfully
Looks like I need to set PYTHON_BIN_PATH
to bin in anaconda? Is there anything else I need to do for running build command successfully? Also, if change is only in python files, then do we still need to run build command? I see files getting replicated in ray/python
directory so I thought I need to run build. Let me know, I would be happy to send PR for docs with this info.
Hmm not sure, but you can avoid building Ray by using the setup-dev script: https://ray.readthedocs.io/en/latest/rllib-dev.html#development-install
Eric
On Mon, Nov 4, 2019 at 7:08 PM Shital Shah notifications@github.com wrote:
@ericl https://github.com/ericl - trying this out. I ran this just to make sure all bits are new after pull but got error:
bazel build //:ray_pkg
Error:
make[1]: Leaving directory '/data/home/shitals/.cache/bazel/_bazel_shitals/db2d1aec9f739be4812e721840071a44/sandbox/linux-sandbox/2500/execroot/com_github_ray_project_ray/src' ERROR: /data/home/shitals/GitHubSrc/ray/BUILD.bazel:750:1: Executing genrule //:python/ray/_raylet.pyx_cython_translation failed (Exit 1) bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox /bin/bash: PYTHON_BIN_PATH: unbound variable Target //:ray_pkg failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 132.736s, Critical Path: 86.41s INFO: 887 processes: 887 linux-sandbox. FAILED: Build did NOT complete successfully
Looks like I need to set PYTHON_BIN_PATH to bin in anaconda? Is there anything else I need to do for running build command successfully? Also, if change is only in python files, then do we still need to run build command? I see files getting replicated in ray/python directory so I thought I need to run build. Let me know, I would be happy to send PR for docs with this info.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/6059?email_source=notifications&email_token=AAADUSQNYYSESCF4A32X3MDQSDPUBA5CNFSM4JHQRYE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDBOMBY#issuecomment-549643783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSX77JEU3QVE3T4QQ3DQSDPUBANCNFSM4JHQRYEQ .
I eventually found ray/build.sh
and it had statements to set these variables. It worked! Right now running the DQN for this pull request on Breakout and results are looking good but have to wait for few hours to get full results.
Update: DQN/Breakout results are inline with 0.7.4 after applying https://github.com/ray-project/ray/pull/6087 on current master. Reward 40.54 after 860K steps.
@ericl - It would be great to merge this PR although I can't comment on its impact outside of Atari environments.
@edoakes could we make sure this ends up in the latest release?
I am still testing this in https://github.com/ray-project/ray/pull/6093, it looks like APEX might still have some issues not completely solved by the patch.
System information
python3 train.py -f pong-appo.yaml
using the rllib train.py and the tuned APPO pong yaml file.Describe the problem
Upon finishing training (termination at 5M steps as in the config), the reward is still around -20, which is the initial reward of a random agent. The comments in the tuned example say
which I cannot reproduce.
Training seemed to go smoothly, I didn't see any errors, except
RuntimeWarning: Mean of empty slice.
andRuntimeWarning: invalid value encountered in double_scalars
at the beginning of training mentioned in #5520.Source code / logs
The final training step logs: