ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.25k stars 5.81k forks source link

APE-X dies unexpectedly #3201

Closed nilsjohanbjorck closed 4 years ago

nilsjohanbjorck commented 6 years ago

I've been trying to run APE-X on 4 nodes similar to p3.16xlarge on AWS running centOS. Ray is installed via pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp36-cp36m-manylinux1_x86_64.whl. The settings are as

pong-apex:
    env: PongNoFrameskip-v4
    run: APEX
    config:
        target_network_update_freq: 50000
        num_workers: 128
        lr: .0001
        gamma: 0.99
        num_envs_per_worker: 8
        sample_batch_size: 20
        optimizer:
            debug: True

For some reason, it seems like APE-X dies unexpectedly after roughly 5-10 minutes. I've attached the error-messages and the debug information just before it dies. Any idea what's going on here?

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 129/288 CPUs, 1/14 GPUs
RUNNING trials:
 - APEX_PongNoFrameskip-v4_0:   RUNNING [pid=44230], 599 s, 19 iter, 10587360 ts, -15.6 rew

Result for APEX_PongNoFrameskip-v4_0:
  done: false
  episode_len_mean: 6349.090909090909
  episode_reward_max: -6.0
  episode_reward_mean: -14.44055944055944
  episode_reward_min: -21.0
  episodes_this_iter: 143
  episodes_total: 3465
  experiment_id: 561dceb3666740c9bcd6c74e4e8c0098
  info:
    learner_queue:
      size_count: 21812
      size_mean: 14.86
      size_quantiles:
      - 7.0
      - 11.9
      - 16.0
      - 16.0
      - 16.0
      size_std: 2.2271955459725574
    max_exploration: 0.4
    min_exploration: 0.0006553600000000003
    num_steps_sampled: 11170880
    num_steps_trained: 3614720
    num_target_updates: 65
    num_weight_syncs: 23228
    pending_replay_tasks: 16
    pending_sample_tasks: 256
    replay_shard_0:
      add_batch_time_ms: 4.58
      policy_default:
        added_count: 2790240
        est_size_bytes: 1547952468
        evicted_hit_count: 2290241
        evicted_hit_mean: 0.332
        evicted_hit_quantiles:
        - 0.0
        - 0.0
        - 0.0
        - 1.0
        - 3.0
        evicted_hit_std: 0.5829030794222998
        num_entries: 500000
        reprio_count: 905216
        reprio_mean: -0.1095462676479631
        reprio_quantiles:
        - -0.9093356702710702
        - -0.39495892867517784
        - -0.11384899339971868
        - 0.17165180335333372
        - 0.9925115270599136
        reprio_std: 0.23512783704988674
        sampled_count: 909312
      replay_time_ms: 97.383
      update_priorities_time_ms: 19.527
    sample_throughput: 19093.236
    timing_breakdown:
      enqueue_time_ms: 40.437
      get_samples_time_ms: 0.427
      learner_dequeue_time_ms: 0.014
      learner_grad_time_ms: 72.931
      put_weights_time_ms: 25.253
      replay_processing_time_ms: 755.598
      sample_processing_time_ms: 452.464
      sample_time_ms: 1210.9
      train_time_ms: 1210.9
      update_priorities_time_ms: 2.73
    train_throughput: 6511.52
  iterations_since_restore: 20
  node_ip: 192.168.28.18
  num_metric_batches_dropped: 0
  pid: 44230
  policy_reward_mean: {}
  time_since_restore: 632.0204825401306
  time_this_iter_s: 32.19131374359131
  time_total_s: 632.0204825401306
  timestamp: 1541174238
  timesteps_since_restore: 11170880
  timesteps_this_iter: 583520
  timesteps_total: 11170880
  training_iteration: 20

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 129/288 CPUs, 1/14 GPUs
Result logdir: /home/ray_results/pong-apex
RUNNING trials:
 - APEX_PongNoFrameskip-v4_0:   RUNNING [pid=44230], 632 s, 20 iter, 11170880 ts, -14.4 rew

Error processing event.
Traceback (most recent call last):
  File "/home/miniconda3/envs/rl_cpu/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/miniconda3/envs/rl_cpu/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/miniconda3/envs/rl_cpu/lib/python3.6/site-packages/ray/worker.py", line 2267, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(010000000173fd6840182d1a04a165956ea7eb80). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task.
Log sync requires cluster to be setup with `ray create_or_update`.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/288 CPUs, 0/14 GPUs
Result logdir: /home/ray_results/pong-apex
ERROR trials:
 - APEX_PongNoFrameskip-v4_0:   ERROR, 1 failures: /home/ray_results/pong-apex/APEX_PongNoFrameskip-v4_0_2018-11-02_11-44-51wflpmsbg/error_2018-11-02_11-57-19.txt [c0018 pid=44230], 632 s, 20 iter, 11170880 ts, -14.4 rew

Traceback (most recent call last):
  File "ray/python/ray/rllib/train.py", line 118, in <module>
    run(args, parser)
  File "ray/python/ray/rllib/train.py", line 112, in run
    queue_trials=args.queue_trials)
  File "/home/miniconda3/envs/rl_cpu/lib/python3.6/site-packages/ray/tune/tune.py", line 124, in run_experiments
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [APEX_PongNoFrameskip-v4_0])
ericl commented 6 years ago

I think this is expected since latest is currently using an unstable backend.

Can you try using 0.5.3 instead? FYI @stephanie-wang @robertnishihara

stephanie-wang commented 6 years ago

Looks like this is a memory corruption issue in the ray backend, caused by #2959. We'll push a fix in the 0.6 release.

pcmoritz commented 6 years ago

Great catch 👍

nilsjohanbjorck commented 6 years ago

Sorry for the interruption, I haven't been able to look into this until now. After installing the latest wheel via pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp36-cp36m-manylinux1_x86_64.whl and pulling the code via git clone https://github.com/ray-project/ray.git the issue seems to persists. I've been trying to run IMPALA on two nodes and get a similar error:

Remote function train failed with:

Traceback (most recent call last):
  File "/home//miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/worker.py", line 848, in _process_task
    *arguments)
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 319, in train
    return Trainable.train(self)
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/agents/impala/impala.py", line 103, in _train
    self.config["collect_metrics_timeout"])
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/optimizers/policy_optimizer.py", line 102, in collect_metrics
    timeout_seconds=timeout_seconds)
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/evaluation/metrics.py", line 38, in collect_episodes
    metric_lists = ray.get(collected)
  File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/worker.py", line 2353, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000046def8798cbcf87b647e8bedb82f93b5). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

A worker died or was killed while executing task 0000000079065e620b1b20ea1f90e64592cbddbb.
A worker died or was killed while executing task 00000000da5aa0b1d59e98aa26474ef5cd995f8f.
A worker died or was killed while executing task 000000005a21e4ff1f383c81461f875892acd5ed.

I'd imagine that this is related, should I open a new issue?

Thanks!!!

nilsjohanbjorck commented 6 years ago

Also, I should mention that I'm unable to get version 0.5.3 to work due to issue https://github.com/ray-project/ray/issues/3334. It might be possible to fix it on the old-source code, but I'd prefer working on master.

robertnishihara commented 6 years ago

Issue #3334 is being fixed in https://github.com/ray-project/ray/pull/3379. Note that you should be able to just downgrade your version of Redis in the meantime.

nilsjohanbjorck commented 6 years ago

Got it! The latest Remote function <unknown> failed with: Invalid return value: ... issue I have is orthogonal to https://github.com/ray-project/ray/issues/3334 as I ran it with the latest source code, in which https://github.com/ray-project/ray/issues/3334 is fixed.