[rllib] Atari broken in 0.7.5+ since RLlib chooses wrong neural net model by default

zplizzi commented 4 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Ray installed from (source or binary): binary
Ray version: 0.7.6
Python version: 3.6.8
Exact command to reproduce: python3 train.py -f pong-appo.yaml using the rllib train.py and the tuned APPO pong yaml file.

Describe the problem

Upon finishing training (termination at 5M steps as in the config), the reward is still around -20, which is the initial reward of a random agent. The comments in the tuned example say

# This can reach 18-19 reward in ~5-7 minutes on a Titan XP GPU
# with 32 workers and 8 envs per worker. IMPALA, when ran with 
# similar configurations, solved Pong in 10-12 minutes.
# APPO can also solve Pong in 2.5 million timesteps, which is
# 2x more efficient than that of IMPALA.

which I cannot reproduce.

Training seemed to go smoothly, I didn't see any errors, except RuntimeWarning: Mean of empty slice. and RuntimeWarning: invalid value encountered in double_scalars at the beginning of training mentioned in #5520.

Source code / logs

The final training step logs:

Result for APPO_PongNoFrameskip-v4_0:
  custom_metrics: {}
  date: 2019-10-31_18-56-51
  done: true
  episode_len_mean: 3710.01
  episode_reward_max: -18.0
  episode_reward_mean: -20.35
  episode_reward_min: -21.0
  episodes_this_iter: 88
  episodes_total: 5366
  experiment_id: e9ccd551521a44e287451f8d87dd7dbe
  hostname: test03-vgqp8
  info:
    learner:
      cur_lr: 0.0005000000237487257
      entropy: 1.7659618854522705
      mean_IS: 1.1852530241012573
      model: {}
      policy_loss: -0.003545303363353014
      var_IS: 0.21974682807922363
      var_gnorm: 23.188478469848633
      vf_explained_var: 0.0
      vf_loss: 0.01947147212922573
    learner_queue:
      size_count: 12504
      size_mean: 14.46
      size_quantiles:
      - 12.0
      - 13.0
      - 15.0
      - 16.0
      - 16.0
      size_std: 1.0432641084595982
    num_steps_replayed: 0
    num_steps_sampled: 5012800
    num_steps_trained: 9999200
    num_weight_syncs: 12532
    sample_throughput: 6554.589
    timing_breakdown:
      learner_dequeue_time_ms: 0.018
      learner_grad_time_ms: 137.841
      learner_load_time_ms: .nan
      learner_load_wait_time_ms: .nan
      optimizer_step_time_ms: 672.661
    train_throughput: 11854.045
  iterations_since_restore: 59
  node_ip: 192.168.2.40
  num_healthy_workers: 32
  off_policy_estimator: {}
  pid: 34
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_env_wait_ms: 10.214430495025196
    mean_inference_ms: 1.736408154661836
    mean_processing_ms: 0.5789328915422826
  time_since_restore: 632.1431384086609
  time_this_iter_s: 11.452256441116333
  time_total_s: 632.1431384086609
  timestamp: 1572548211
  timesteps_since_restore: 5012800
  timesteps_this_iter: 75200
  timesteps_total: 5012800
  training_iteration: 59
  trial_id: b183a16a
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/65 CPUs, 0/1 GPUs, 0.0/193.7 GiB heap, 0.0/39.6 GiB objects
Memory usage on this node: 24.5/60.0 GiB
Result logdir: /root/ray_results/pong-appo
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
 - APPO_PongNoFrameskip-v4_0:   TERMINATED, [33 CPUs, 1 GPUs], [pid=34], 632 s, 59 iter, 5012800 ts, -20.4 rew

zplizzi commented 4 years ago

I see the same behavior with the PPO algorithm, here's the log output at ~5M steps using this tuned config file:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 33/103 CPUs, 1/3 GPUs, 0.0/351.22 GiB heap, 0.0/77.25 GiB objects
Memory usage on this node: 14.1/60.0 GiB
Result logdir: /root/ray_results/pong-ppo
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_PongNoFrameskip-v4_0:    RUNNING, [33 CPUs, 1 GPUs], [pid=33], 1253 s, 1042 iter, 5210000 ts, -20.3 rew

zplizzi commented 4 years ago

Hmm, I tried a couple more of the tuned configs and can't get them to train, either.

atari-ppo.yaml at 7M timesteps (edited to play PongNoFrameskip-v4):

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 11/103 CPUs, 1/3 GPUs, 0.0/351.22 GiB heap, 0.0/77.25 GiB objects
Memory usage on this node: 31.4/60.0 GiB
Result logdir: /root/ray_results/atari-ppo
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_PongNoFrameskip-v4_0_env=PongNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=31], 5848 s, 1463 iter, 7315000 ts, -20.6 rew

atari-apex.yaml at 10M timesteps (only Qbert):

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 17/103 CPUs, 1/3 GPUs, 0.0/351.22 GiB heap, 0.0/77.25 GiB objects
Memory usage on this node: 48.8/60.0 GiB
Result logdir: /root/ray_results/apex
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - APEX_QbertNoFrameskip-v4_0_env=QbertNoFrameskip-v4:  RUNNING, [17 CPUs, 1 GPUs], [pid=34], 6208 s, 192 iter, 10721120 ts, 286 rew

Here's the Dockerfile I'm using to run these experiments (in an AWS Kubernetes cluster with p3.2xlarge and c5n.9xlarge instances): https://gist.github.com/zplizzi/17e49ffabff848d16973e83277dac425 It was built a couple days ago so should have roughly the most recent stable versions of all the installed packages.

ericl commented 4 years ago

I can reproduce this. Trying to figure out when this first broke.

ericl commented 4 years ago

It seems it works in Ray 0.7.4.

RUNNING trials:
 - PPO_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=84978], 274 s, 73 iter, 365000 ts, 7.45 rew
 - PPO_BreakoutNoFrameskip-v4_1_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=84992], 275 s, 74 iter, 370000 ts, 9.37 rew
 - PPO_BreakoutNoFrameskip-v4_2_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=85027], 274 s, 74 iter, 370000 ts, 3.6 rew
 - PPO_BreakoutNoFrameskip-v4_3_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=85018], 277 s, 75 iter, 375000 ts, 12.1 rew

But not 0.7.5+ (<= 2 reward for Breakout no matter how long).

zplizzi commented 4 years ago

Cool, 0.7.4 seems to work for me also.

ericl commented 4 years ago

@michaelzhiluo any updates?

michaelzhiluo commented 4 years ago

I tested PPO on Pong, SpaceInvaders, and Breakout on 0.8.0. It works on 0.8.0 :).

Pong reached 18-19 reward in ~2.5-3.0 million timesteps. I got it to work by setting kl_coeff: 0. That is the only changed need for the file in tuned_examples/pong-ppo.yaml. Looks like the trust region is hurting learning, since it is preventing the agent from making big steps. The parameter to performance space most likely requires a big jump to reach higher rewards in Pong (as in 18-20 rew).

For other environments: SpaceInvaders reaches 900 reward in 14 million timesteps. Breakout PPO does not work as well as other agents, but can attain 200 reward in 15-20 million timesteps, as in the original PPO paper.

ericl commented 4 years ago

@michaelzhiluo interesting. We haven't really touched the KL penalty though so I don't know why you would suddenly need to set the coeff to zero. Were you able to identify the commit(s) that broke / fixed the issue?

michaelzhiluo commented 4 years ago

Not sure, the configuration parameters for pong-ppo.yaml is subtly different between 0.7.4 and newer versions of Ray. This includes gamma and learning rate. Most likely, it is just hyperparameter difference.

ericl commented 4 years ago

Can you figure out which commit caused it with git bisect? We need to root cause the issue otherwise there could be other problems.

ericl commented 4 years ago

By the way, pong-ppo.yaml has not changed for the entire history it has been checked in.... https://github.com/ray-project/ray/commits/5d7afe8092f5521a8faa8cdfc916dcd9a5848023/rllib/tuned_examples/pong-ppo.yaml

michaelzhiluo commented 4 years ago

Try this commit: https://github.com/ray-project/ray/commit/d8205976e8e5c98e177c0d1d03fefef21a69d5d9#diff-6921e691057d93f6a4d6507ad9012352. What is different is the LR and Gamma, which is not included in this old version of the yaml file. Looking at agents/ppo/ppo.py, we see that lr is 0.00005, a lot smaller than the current yaml file.

Looks like all prior commits is hard to find since Rllib was moved to the home directory. You have to go back to the commit right before that happened to find all prior commits: https://github.com/ray-project/ray/commits/384cbfb21140aad820b3c72c4624edc3cf08beb2/python/ray/rllib/tuned_examples/pong-ppo.yaml

sytelus commented 4 years ago

The problem probably also affects other algos. To replicate results, I'm using settings from rl-experiments. For below command line on code from master I get this chart for Breakout which seems wrong as reward seems to be stuck at 2.0.

rllib train -f atari-apex/atari-apex.yaml

zplizzi commented 4 years ago

In some other testing I found that worker.get_policy().model returns a fully-connected model type in 0.7.6, and a vision-type model in 0.7.4. So probably that's what's causing these issues?

This is the worker that I was testing with:

worker = ray.rllib.evaluation.RolloutWorker(
      env_creator=lambda _: gym.make("PongNoFrameskip-v4"),
     policy=AsyncPPOTFPolicy,
     model_config=model_config,
     policy_config=policy_config)

with the standard model and policy config for the Atari-PPO examples.

ericl commented 4 years ago

Thanks @zplizzi , that was the issue: https://github.com/ray-project/ray/pull/6087

@michaelzhiluo , it's quite miraculous you got Atari working with fcnet then...

sytelus commented 4 years ago

@ericl I think only Pong works. I did fresh runs on code from master (blue line) and releases/0.7.4 branch (red line) for Breakout with DQN. As it can be seen current master doesn't progress beyond score of 2.0.

@zplizzi - this would explain why Apex and DQN are also broken.

Command:

# from rl-experiments repo
rllib train -f atari-dqn/dueling-ddqn.yaml

ericl commented 4 years ago

@sytelus can you try the fix above (https://github.com/ray-project/ray/pull/6087?)

sytelus commented 4 years ago

@ericl - trying this out. I ran this just to make sure all bits are new after pull but got error:

bazel build //:ray_pkg

Error:

make[1]: Leaving directory '/data/home/shitals/.cache/bazel/_bazel_shitals/db2d1aec9f739be4812e721840071a44/sandbox/linux-sandbox/2500/execroot/com_github_ray_project_ray/src'
ERROR: /data/home/shitals/GitHubSrc/ray/BUILD.bazel:750:1: Executing genrule //:python/ray/_raylet.pyx_cython_translation failed (Exit 1) bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
/bin/bash: PYTHON_BIN_PATH: unbound variable
Target //:ray_pkg failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 132.736s, Critical Path: 86.41s
INFO: 887 processes: 887 linux-sandbox.
FAILED: Build did NOT complete successfully

Looks like I need to set PYTHON_BIN_PATH to bin in anaconda? Is there anything else I need to do for running build command successfully? Also, if change is only in python files, then do we still need to run build command? I see files getting replicated in ray/python directory so I thought I need to run build. Let me know, I would be happy to send PR for docs with this info.

ericl commented 4 years ago

Hmm not sure, but you can avoid building Ray by using the setup-dev script: https://ray.readthedocs.io/en/latest/rllib-dev.html#development-install

Eric

On Mon, Nov 4, 2019 at 7:08 PM Shital Shah notifications@github.com wrote:

@ericl https://github.com/ericl - trying this out. I ran this just to make sure all bits are new after pull but got error:

bazel build //:ray_pkg

Error:

make[1]: Leaving directory '/data/home/shitals/.cache/bazel/_bazel_shitals/db2d1aec9f739be4812e721840071a44/sandbox/linux-sandbox/2500/execroot/com_github_ray_project_ray/src' ERROR: /data/home/shitals/GitHubSrc/ray/BUILD.bazel:750:1: Executing genrule //:python/ray/_raylet.pyx_cython_translation failed (Exit 1) bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox /bin/bash: PYTHON_BIN_PATH: unbound variable Target //:ray_pkg failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 132.736s, Critical Path: 86.41s INFO: 887 processes: 887 linux-sandbox. FAILED: Build did NOT complete successfully

Looks like I need to set PYTHON_BIN_PATH to bin in anaconda? Is there anything else I need to do for running build command successfully? Also, if change is only in python files, then do we still need to run build command? I see files getting replicated in ray/python directory so I thought I need to run build. Let me know, I would be happy to send PR for docs with this info.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/6059?email_source=notifications&email_token=AAADUSQNYYSESCF4A32X3MDQSDPUBA5CNFSM4JHQRYE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDBOMBY#issuecomment-549643783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSX77JEU3QVE3T4QQ3DQSDPUBANCNFSM4JHQRYEQ .

sytelus commented 4 years ago

I eventually found ray/build.sh and it had statements to set these variables. It worked! Right now running the DQN for this pull request on Breakout and results are looking good but have to wait for few hours to get full results.

sytelus commented 4 years ago

Update: DQN/Breakout results are inline with 0.7.4 after applying https://github.com/ray-project/ray/pull/6087 on current master. Reward 40.54 after 860K steps.

@ericl - It would be great to merge this PR although I can't comment on its impact outside of Atari environments.

ericl commented 4 years ago

@edoakes could we make sure this ends up in the latest release?

I am still testing this in https://github.com/ray-project/ray/pull/6093, it looks like APEX might still have some issues not completely solved by the patch.

ray-project / ray