[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

34k stars 5.78k forks source link

[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+ #11239

Closed soundway closed 3 years ago

soundway commented 4 years ago

This is not a contribution.

Versions: python: 3.6.8 ray: 1.0 pytorch: 1.6 tensorflow: 1.15 OS: Ubuntu 18.04 Docker

Since upgrading to 0.8.7 and 1.0, we are experiencing multiple stability issues that result in jobs crashing with The actor died unexpectedly before finishing this task errors. Note that these issues are quite difficult to reproduce using the default environment provided by RLLib (often needs over 40 hours for QBert), but with our custom environment they happen much earlier during the execution — sometimes as early as 4 minutes, and they also happen very consistently. We’ve never experienced anything like this with 0.8.5 or prior. Memory/resource shouldn’t be the bottleneck. Even though our custom environments use more memory, we also use nodes with much larger memory capacity for their rollouts. We closely monitor them via Grafana to ensure that all usages fall well below what’s available (i.e. overall memory usage is usually far below 50%). For every node, we assign 30% of the node’s memory for object store, which should be far more than enough based on the experience/model size.

Here’s an example of the errors (produced by the script provided later):

2020-10-05 01:55:09,393\u0009ERROR trial_runner.py:567 -- Trial PPO_QbertNoFrameskip-v4_b43b9_00027: Error processing event.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: \u001b[36mray::PPO.train()\u001b[39m (pid=4251, ip=172.30.96.106)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 791, in apply_foreach
    result = fn(item)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/execution/metric_ops.py", line 79, in __call__
    timeout_seconds=self.timeout_seconds)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/metrics.py", line 75, in collect_episodes
    metric_lists = ray.get(collected)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here's another variant of the error when running our own custom environment:

Failure # 1 (occurred at 2020-10-03_02-10-38)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::PPO.train() (pid=524, ip=172.30.58.198)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 876, in apply_flatten
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 828, in add_wait_hooks
    item = next(it)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here's the example script that produced the first error by training QBert with PPO. Note that it might take over 40 hours for the error to occur. The setup is a p3.2xlarge instance for the trainer, and the rollout workers are on a c5.18xlarge instance. 30% of memory on each instance is dedicated to object store.

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo

if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=2000,
                    grace_period=100,
                    reduction_factor=3,
                    brackets=3),
                 )

One of the things we tried when debugging the problem is by storing all execution ops references in memory — and somehow it helps. We discovered this mitigation almost accidentally as we were debugging our own execution plan. For instance, for the PPO execution plan, if we modify it to also return all execution ops in a list that gets held in memory, then the time it takes for the job to crash gets significantly increased and we no longer get the same error. Instead, the error becomes ray.exceptions.ObjectLostError: Object XXXXX is lost due to node failure -- which seems to be caused by some node failed heartbeat check. It’s unclear if our attempted mitigation is just a fluke or it may point in the right direction to fix the underlying problem, or these errors share the same underlying cause. Here’s a modified script. Note that the new error is no longer guaranteed to be reproducible even when running for a long time. But with our environment it's quite consistent:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
from ray.rllib.agents.ppo.ppo import UpdateKL, warn_about_bad_reward_scales
from ray.rllib.execution.common import STEPS_SAMPLED_COUNTER, _get_shared_metrics
from ray.rllib.execution.rollout_ops import ParallelRollouts, ConcatBatches, \
    StandardizeFields, SelectExperiences
from ray.rllib.execution.train_ops import TrainOneStep
from ray.rllib.execution.metric_ops import StandardMetricsReporting
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.util.iter import from_actors

def custom_ppo_execution_plan(workers, config):
    """Copy of PPO's execution plan, except we store all ops in a list and return them."""
    # Modified from ParallelRollout's bulk_sync mode.
    workers.sync_weights()
    def report_timesteps(batch):
        metrics = _get_shared_metrics()
        metrics.counters[STEPS_SAMPLED_COUNTER] += batch.count
        return batch
    ops = [from_actors(workers.remote_workers())]
    ops.append(ops[-1].batch_across_shards())
    ops.append(ops[-1].for_each(lambda batches: SampleBatch.concat_samples(batches)))
    ops.append(ops[-1].for_each(report_timesteps))

    # Collect batches for the trainable policies.
    ops.append(ops[-1].for_each(
        SelectExperiences(workers.trainable_policies())))
    # Concatenate the SampleBatches into one.
    ops.append(ops[-1].combine(
        ConcatBatches(min_batch_size=config["train_batch_size"])))
    # Standardize advantages.
    ops.append(ops[-1].for_each(StandardizeFields(["advantages"])))

    # Perform one training step on the combined + standardized batch.
    ops.append(ops[-1].for_each(
        TrainOneStep(
            workers,
            num_sgd_iter=config["num_sgd_iter"],
            sgd_minibatch_size=config["sgd_minibatch_size"])))

    # Update KL after each round of training.
    ops.append(ops[-1].for_each(lambda t: t[1]).for_each(UpdateKL(workers)))

    # Warn about bad reward scales and return training metrics.
    return (StandardMetricsReporting(ops[-1], workers, config) \
        .for_each(lambda result: warn_about_bad_reward_scales(config, result)),
        ops)

class ExecutionPlanWrapper:
    """A wrapper for custom_ppo_execution_plan that stores all ops in the object."""

    def __init__(self, workers, config):
        self.execution_plan, self.ops = custom_ppo_execution_plan(workers, config)

    def __next__(self):
        return next(self.execution_plan)

if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer.with_updates(
        name="CustomPPO",
        execution_plan=ExecutionPlanWrapper)

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=2000,
                    grace_period=100,
                    reduction_factor=3,
                    brackets=3),
                 )

In the worker logs, we would find the following message around the time we get the object lost error:

2020-10-04 00:19:40,710\u0009WARNING worker.py:1072 -- The node with node id f7c78d2999929f603ebdf4d2c4508f949f6dafb0 has been marked dead because the detector has missed too many heartbeats from it.

Further, sometimes — not always, the node that timed out has a drastic sharp increase (2-3x) in memory usage according to our Grafana within several seconds near the end — which is far more than the amount of memory it should use. We attempted to mitigate this second error by increasing the num_heartbeats_timeout setting in --system_config, but it doesn’t seem to make much difference. None of these issues exist with the old optimizer scheme in 0.8.5 or earlier and we can train with our custom environment for days without any issue.

We also encounter problems that after a trial terminates, a new trial doesn’t get started for some reason in certain cases (this can only be reproduced with our environments). It’s unclear if that’s related to the issue above at all and it’s been hard to debug it with these other instability issues. We’ll likely file another more detailed bug report related to that later when this is addressed.

ericl commented 4 years ago

cc @rkooo567 @stephanie-wang any ideas here? I can't think of why this might be.

@soundway , are there logs that say why the worker is dying? Perhaps run dmesg to see if it's some resource exhaustion error or segfault. The full logs in /tmp/ray may also have something.

rkooo567 commented 4 years ago

I am not aware of any known issues related to this... Seems pretty bad though. I also would like to see full logs (if it is crashed within 4~5 minutes, full logs for that cluster would be ideal).

rkooo567 commented 4 years ago

Btw, this could be the same issue? https://github.com/ray-project/ray/issues/9293

soundway commented 4 years ago

Past logs are lost along with the instances so I will have to launch new ones. It will take a long time (and potentially many costly runs) for me to reproduce the error with QBert, but I will try. Unfortunately, I don’t think I can provide the full logs when running with our custom environments. That said, I went through several different types of *.err on the different machines, and I can show some findings here for our custom setup.

We use 1x p3.8xlarge instance as trainer, and 4x r5.24xlarge (768GB RAM) for rollouts. Each trainer spawns some number of workers that roughly occupy one r5.24xlarge instance. We run with the default PPO implementation provided by RLLib 1.0 without any modification to the policy/trainer class.

In this failed job, it took roughly 1 hour after launching. Below is the memory usage graph:

Note that for cpuworker0, the resource goes up significantly at around 03:13, which is also the time training stalls. Not all jobs that failed in the past exhibit this behavior in memory (but it could be a failure on our Grafana side to capture it). This is raylet.err on cpuworker0:

F1008 10:12:38.038853   490   521 malloc.cc:56] failed to find entry in mmap_records for fd 0
*** Check failure stack trace: ***
    @     0x5592b35fc9cd  google::LogMessage::Fail()
    @     0x5592b35fdb2c  google::LogMessage::SendToLog()
    @     0x5592b35fc6a9  google::LogMessage::Flush()
    @     0x5592b35fc8c1  google::LogMessage::~LogMessage()
    @     0x5592b35b4ad9  ray::RayLog::~RayLog()
    @     0x5592b3364676  plasma::GetMmapSize()
    @     0x5592b3345964  plasma::PlasmaStore::ReturnFromGet()
    @     0x5592b3345331  boost::asio::detail::wait_handler<>::do_complete()
    @     0x5592b38fa46f  boost::asio::detail::scheduler::do_run_one()
    @     0x5592b38fb971  boost::asio::detail::scheduler::run()
    @     0x5592b38fc9a2  boost::asio::io_context::run()
    @     0x5592b334b1ee  plasma::PlasmaStoreRunner::Start()
    @     0x7f8eac0449e0  (unknown)
    @     0x7f8eab9d16db  start_thread
    @     0x7f8eab2daa0f  clone

Note that 10:12:38 is the UTC time so it’s the same as the 3:12 AM time shown in the graph. It seems like the memory usage on cpuworker0 begins to rise after this error message. In its raylet.out, we see over 11K messages that are similar to this throughout the whole hour of training:

W1008 10:12:11.417868   490   533 object_manager.cc:777] ReceiveObjectChunk index 3 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.417868   490   533 object_manager.cc:777] ReceiveObjectChunk index 3 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418118   490   537 object_manager.cc:777] ReceiveObjectChunk index 7 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418118   490   537 object_manager.cc:777] ReceiveObjectChunk index 7 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418407   490   531 object_manager.cc:777] ReceiveObjectChunk index 9 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418407   490   531 object_manager.cc:777] ReceiveObjectChunk index 9 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418475   490   534 object_manager.cc:777] ReceiveObjectChunk index 8 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418475   490   534 object_manager.cc:777] ReceiveObjectChunk index 8 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store

But I also see similar things in other workers' logs (though with less number of such messages around 2k - 8K).

Around this time, we see the following messages across all workers:

2020-10-08 10:13:08,003\u0009WARNING worker.py:1072 The node with node id 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 has been marked dead because the detector has missed too many heartbeats from it.

I verified that that node ID is indeed associated with cpuworker0. In the python-core-driver logs across all machines, they are filed with messages below for the whole hour of training

W1008 09:16:25.774077   548  2218 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses

On driver node (trainer0), dashboard.err is filled with these errors — though we never used the dashboard:

2020-10-08 09:15:24,315 WARNING node_stats.py:317 -- Unexpected channel data received, channel: b'ERROR_INFO:*', data: Failed to load data because of 'ascii' codec can't decode byte 0xec in position 11: ordinal not in range(128)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/dashboard.py", line 697, in run
    timeout=2)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "{"created":"@1602148527.471726244","description":"Error received from peer ipv4:172.28.95.82:62498","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Deadline Exceeded","grpc_status":4}"

And its monitor.err is filled with these:

Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
Monitor: could not find ip for client dd1aa84a12ed78a2f57c548f93210a1848c136d3
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
Monitor: could not find ip for client dd1aa84a12ed78a2f57c548f93210a1848c136d3
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0

For 4 of the pythone-core-worker log files on the driver (I believe these are the 4 trials?), they are much larger than others and contain these messages after cpuworker0’s raylet error (I sampled a few messages in chronological order):

I1008 10:13:08.228957   771  1473 direct_actor_transport.cc:187] Failing pending tasks for actor 6120b21703000000
I1008 10:13:08.229079   771  1473 actor_manager.cc:127] received notification on actor, state: DEAD, actor_id: 5c3e836b03000000, ip address: 172.27.134.161, port: 10001, worker_id: 788ca7ba903276d5382c515217a58828ce64bd8b, raylet_id: 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, num_restarts: 0
I1008 10:13:08.229198   771  1473 direct_actor_transport.cc:187] Failing pending tasks for actor 5c3e836b03000000                                           
I1008 10:13:08.229315   771  1473 actor_manager.cc:127] received notification on actor, state: DEAD, actor_id: 89cd273a03000000, ip address: 172.27.134.161, port: 10074, worker_id: 740db7a6fa98d329b9930fa8b5c4bedbfa879fc5, raylet_id: 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, num_restarts: 0                                              
I1008 10:13:08.229434   771  1473 direct_actor_transport.cc:187] Failing pending tasks for actor 89cd273a03000000
I1008 10:13:08.229483   771  1473 service_based_accessor.cc:791] Received notification for node id = 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, IsAlive = 0                                                                                                                                                                                      
I1008 10:13:08.229501   771  1473 core_worker.cc:611] Node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4                                                                                                                                                                                                                                     
I1008 10:13:08.229929   771  1473 core_worker.cc:619] Object 25f56eccb8469e216120b2170300000001000000 lost due to node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
I1008 10:13:08.231173   771  1473 object_recovery_manager.cc:42] Starting recovery for object c2dfc437c284c9611b114c0f0300000001000000                                                                                                                                                                                                          
I1008 10:13:08.231225   771  1473 core_worker.cc:619] Object 65425a548e61bc9f7c13d2ef0300000001000000 lost due to node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4                                                                                                                                                                         
I1008 10:13:08.231236   771  1473 object_recovery_manager.cc:42] Starting recovery for object 65425a548e61bc9f7c13d2ef0300000001000000
I1008 10:13:08.231266   771  1473 object_recovery_manager.cc:63] Lost object 25f56eccb8469e216120b2170300000001000000 has 1 locations                                                                                                                                                                                                           
I1008 10:13:08.231350   771  1473 object_recovery_manager.cc:63] Lost object af78fa8b791e72e9fab5fd080300000001000000 has 0 locations                                                                                                                                                                                                           
W1008 10:13:09.222667   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: af78fa8b791e72e9fab5fd080300000001000000.
W1008 10:13:09.222667   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: af78fa8b791e72e9fab5fd080300000001000000.                                                                                                                                                                               
I1008 10:13:09.222728   771  1473 object_recovery_manager.cc:47] Recovery complete for object af78fa8b791e72e9fab5fd080300000001000000                                                                                                                                                                                                          
I1008 10:13:09.222817   771  1473 object_recovery_manager.cc:63] Lost object a14243d4a233d5462a56d9370300000001000000 has 1 locations
W1008 10:13:15.225054   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: c2dfc437c284c9611b114c0f0300000001000000.
I1008 10:13:15.225116   771  1473 object_recovery_manager.cc:47] Recovery complete for object c2dfc437c284c9611b114c0f0300000001000000                                                                                                                                                                                                          
I1008 10:13:15.225193   771  1473 object_recovery_manager.cc:63] Lost object 65425a548e61bc9f7c13d2ef0300000001000000 has 0 locations
W1008 10:13:16.225339   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: 65425a548e61bc9f7c13d2ef0300000001000000.                                                                                                                                                                               
W1008 10:13:16.225339   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: 65425a548e61bc9f7c13d2ef0300000001000000.
I1008 10:13:16.225394   771  1473 object_recovery_manager.cc:47] Recovery complete for object 65425a548e61bc9f7c13d2ef0300000001000000                                                                                                                                                                                                          
I1008 10:13:16.225814   771  1473 object_recovery_manager.cc:47] Recovery complete for object 25f56eccb8469e216120b2170300000001000000
I1008 10:13:16.230865   771  1473 object_recovery_manager.cc:47] Recovery complete for object 86b4165d0ea97a2375623aae0300000001000000                                                                                                                                                                                                          
I1008 10:13:16.230893   771  1473 object_recovery_manager.cc:47] Recovery complete for object bc30d377993870e56d86ca0d0300000001000000
W1008 10:13:26.228111   771   771 plasma_store_provider.cc:415] Attempted 50 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
W1008 10:13:26.228111   771   771 plasma_store_provider.cc:415] Attempted 50 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
W1008 10:14:16.252666   771   771 plasma_store_provider.cc:415] Attempted 100 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
…
W1008 10:33:28.116824   771   771 plasma_store_provider.cc:415] Attempted 1250 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc

Another thing that's worth noting is that since 0.8.7, we frequently see these messages in our console:

\u001b[2m\u001b[36m(pid=1064, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,460\u0009INFO (unknown file):0 -- gc.collect() freed 7962 refs in 0.5344230659998175 seconds
\u001b[2m\u001b[36m(pid=1302, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,465\u0009INFO (unknown file):0 -- gc.collect() freed 7890 refs in 0.5290356460000112 seconds
\u001b[2m\u001b[36m(pid=1424, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,465\u0009INFO (unknown file):0 -- gc.collect() freed 7962 refs in 0.5253929880000214 seconds
\u001b[2m\u001b[36m(pid=703, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,485\u0009INFO (unknown file):0 -- gc.collect() freed 8102 refs in 0.5600109100000736 seconds
\u001b[2m\u001b[36m(pid=1096, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,466\u0009INFO (unknown file):0 -- gc.collect() freed 352 refs in 0.5303756779999276 seconds

Given that some of the gc.collect() time could get to several seconds as training progresses, we aren't sure if this is an expected behavior with the new upgrade. Just want to check if this is now the norm. According to the raylet.out log on the driver, this seems to happen every 10 seconds.

Other than the logs shown here, most other .err files are either empty or uneventful, though I could’ve easily overlooked things. Please let me know if there’s any particular log file I should be looking at or provide.

Our assumption has been that these issues came after 0.8.5 since we’ve run many very large jobs in 0.8.5 and haven’t encountered any problem.

rkooo567 commented 4 years ago

Can you also check if you have any meaningful logs in gcs_server.err.

And yeah we've changed how actors are managed from the version 0.8.6 (in a more centralized way). I guess that change could be related to this issue.

rkooo567 commented 4 years ago

And do you have any info about which process uses the memory the most when it starts exploding?

soundway commented 4 years ago

gcs_server.err is empty.

I see messages in gcs_server.out after the plasma store error (and nothing else for 1 hour), and here are some sampled messages chronologically.

I1008 10:13:07.989955   452   452 gcs_node_manager.cc:389] Removing node, node id = 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
W1008 10:13:07.990010   452   452 gcs_placement_group_manager.cc:328] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, rescheduling the placement groups on the dead node.
W1008 10:13:07.990010   452   452 gcs_placement_group_manager.cc:328] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, rescheduling the placement groups on the dead node.
W1008 10:13:07.990022   452   452 gcs_actor_manager.cc:731] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, reconstructing actors.
W1008 10:13:07.990022   452   452 gcs_actor_manager.cc:731] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, reconstructing actors.
W1008 10:13:07.990080   452   452 gcs_actor_manager.cc:800] Actor is failed 2ab78bd103000000 on worker fada8da402dfd2f16d559a2f6c61278c4615cb6b at node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, need_reschedule = 1, remaining_restarts = 0
W1008 10:13:07.990080   452   452 gcs_actor_manager.cc:800] Actor is failed 2ab78bd103000000 on worker fada8da402dfd2f16d559a2f6c61278c4615cb6b at node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, need_reschedule = 1, remaining_restarts = 0
I1008 10:13:09.561029   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 0b4efae1b623ef4c58fa10a39f07fbd06332fcff, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.561241   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 61d3289626bf566577c55e20db746f7ff89703bc, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.561323   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 659362b6c0464da8fc94a2e960bc04beb427a4c3, node id = 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, address = 172.28.95.82
I1008 10:13:09.561383   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 264a99e504c7d20652ca66ed3f0fe8f91a4af445, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.571002   452   452 gcs_actor_manager.cc:680] Worker 128052d1c62b55829a01c5914df3a5840f49f7fe on node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0 intentional exit.
W1008 10:13:09.571030   452   452 gcs_actor_manager.cc:800] Actor is failed 3ad7973003000000 on worker 128052d1c62b55829a01c5914df3a5840f49f7fe at node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, need_reschedule = 0, remaining_restarts = 0
W1008 10:13:09.571030   452   452 gcs_actor_manager.cc:800] Actor is failed 3ad7973003000000 on worker 128052d1c62b55829a01c5914df3a5840f49f7fe at node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, need_reschedule = 0, remaining_restarts = 0
I1008 10:13:09.571097   452   452 gcs_actor_manager.cc:680] Worker cd56eaaea46fa38d650ef69abde2b30a4c0c7471 on node a471284266f5a2831a321480a5f5713c780770cf intentional exit.
I1008 10:13:09.587844   452   452 gcs_actor_manager.cc:1065] Erasing actor 4f4801cc03000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.587901   452   452 gcs_actor_manager.cc:558] Destroying actor, actor id = db2b89c703000000
I1008 10:13:09.587916   452   452 gcs_actor_manager.cc:1065] Erasing actor db2b89c703000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.587986   452   452 gcs_actor_manager.cc:558] Destroying actor, actor id = 2da0d27603000000
I1008 10:13:09.588001   452   452 gcs_actor_manager.cc:1065] Erasing actor 2da0d27603000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.588011   452   452 gcs_actor_manager.cc:558] Destroying actor, actor id = 5295519a03000000
W1008 10:13:09.597388   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597404   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597404   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597421   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597421   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
I1008 10:13:19.293956   452   452 gcs_actor_manager.cc:104] Registering actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.295037   452   452 gcs_actor_manager.cc:109] Registered actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.295799   452   452 gcs_actor_manager.cc:127] Creating actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.296147   452   452 gcs_actor_scheduler.cc:196] Start leasing worker from node a471284266f5a2831a321480a5f5713c780770cf for actor 26b50a7003000000
I1008 10:13:19.320003   452   452 gcs_actor_manager.cc:104] Registering actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.320806   452   452 gcs_actor_manager.cc:109] Registered actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.321512   452   452 gcs_actor_manager.cc:127] Creating actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.321856   452   452 gcs_actor_scheduler.cc:196] Start leasing worker from node 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4 for actor 9f34da8403000000
I1008 10:13:19.325047   452   452 gcs_actor_scheduler.cc:240] Finished leasing worker from 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4 for actor 9f34da8403000000
I1008 10:13:19.325140   452   452 core_worker_client_pool.cc:28] Connected to 172.27.122.67:10022
I1008 10:13:19.326241   452   452 gcs_actor_scheduler.cc:322] Start creating actor 9f34da8403000000 on worker a27ce1ebec1c1efce3eeef588e2b1ed7e69d6204 at node 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4

I don't know which process uses the most memory. But it seems like the plasma error happened before the drastic increase in memory, and not as a result of the increase in memory usage (the memory usage peaked 10 mins after the everything went haywire). This is assuming the timestamps between the workers and Grafana are synced correctly.

rkooo567 commented 4 years ago

I see. So, I think what happened is one of your nodes had out of memory for some reason, and actor on that nodes are dead, and that just made your script failed (because you did ray.get on dead actors' ref).

There were drastic changes in how to manage the actor lifecycle from 0.8.6, but it doesn't seem to be really related based on your logs (the basic idea is now all actors are managed by gcs_server, and it looks like it worked correct).

@ericl Idk the detail of object broadcasting issues, but is there any way that's related to this failure?

Also, @soundway, when you ran with 0.8.5, did you see the same memory explosion issue?

rkooo567 commented 4 years ago

also cc @stephanie-wang @barakmich

soundway commented 4 years ago

With 0.8.5 all memory usages are stable for long running jobs with many actors/nodes.

soundway commented 4 years ago

What would cause the plasma error as I showed earlier? (i.e. failed to find entry in mmap_records for fd 0) I didn't find anything too useful from Google. That seems like the root cause as it has the earliest timestamp of all events. OOM seems to be the aftermath, and I have jobs that failed without the spikes in memory.

soundway commented 4 years ago

I've been trying to get logs for the vanilla setup of QBert+PPO with the reproduction script I provided at the beginning of this issue (1 x p3.2xlarge + 1 x c5.18xlarge). It takes about 20-60+ hours to reproduce. The error in raylet.err does look slightly different from when we were running our custom environment but it's again coming from Plasma store. I removed most of the duplicated IOError: 14 messages. The memory usage also began to increase around the unix timestamp the error was thrown (though it didn't reach peak until a few minutes later).

E1013 08:51:42.808821   490   490 object_manager.cc:173] b5a2fc76f2fb8fc0afaed5c60200000003000000 attempted to pull an object that's already local.
E1013 09:08:40.582950   490   490 object_manager.cc:233] The object manager with ID fc03b82c14383ce31d349f62b1dd4c4d820fa70c is trying to pull object db00ef4a5318cc44afaed5c60200000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
E1013 09:09:50.811242   490   490 object_manager.cc:233] The object manager with ID fc03b82c14383ce31d349f62b1dd4c4d820fa70c is trying to pull object 011a69285e40df7dafaed5c60200000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
E1013 09:24:41.045271   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045377   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045403   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045436   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045487   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405635   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405663   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405692   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405714   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405815   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405865   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405961   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406013   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406117   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406142   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
*** Aborted at 1602581362 (unix time) try "date -d @1602581362" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x59) received by PID 490 (TID 0x7f7c297fa700) from PID 89; stack trace: ***
    @     0x7f7c323de8a0 (unknown)
    @     0x5590123261cd std::__detail::_Map_base<>::operator[]()
    @     0x55901232b84b plasma::PlasmaStore::ReturnFromGet()
    @     0x55901232b331 boost::asio::detail::wait_handler<>::do_complete()
    @     0x5590128e046f boost::asio::detail::scheduler::do_run_one()
    @     0x5590128e1971 boost::asio::detail::scheduler::run()
    @     0x5590128e29a2 boost::asio::io_context::run()
    @     0x5590123311ee plasma::PlasmaStoreRunner::Start()
    @     0x7f7c32a469e0 (unknown)
    @     0x7f7c323d36db start_thread
    @     0x7f7c31cdca0f clone

During this process I was tweaking some of the settings and I might've found a potentially separate issue given the new error I'm getting. The only things I changed are max_t and grace_period in the scheduler to make each trial much shorter (20 and 10 respectively). This would result in Check failed: actor_registry_.count(id.ActorId()) > 0 after running for several hours as opposed to days. Pasting the script again with the changes:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo

if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=10000,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=5,
                    grace_period=5,
                    reduction_factor=3,
                    brackets=3),
                 )

Here's a sampled copy of raylet.err. I removed most of the duplicates (all IOError: 14). It's unclear if this new error is related to the one we found when running our own custom environment or when running QBert with long trial durations. The timing of this error is also right before we see increase in memory usage.

E1013 23:35:50.200567   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.200614   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.200753   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1013 23:35:50.200881   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.201339   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:35:50.201450   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1013 23:35:50.201604   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.201673   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:35:50.204852   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204876   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204927   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204946   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:38:41.597056   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:38:41.597164   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:38:41.597545   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:41:24.079257   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079306   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079330   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079609   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.370667   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
E1014 00:19:38.115996   490   490 object_manager.cc:173] 6e6502fb8f6647db10a4406c0100000003000000 attempted to pull an object that's already local.
E1014 00:20:40.974277   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.745927   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746177   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1014 00:53:53.746242   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746289   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746348   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 00:53:53.746407   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:57:30.838287   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.857200   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 01:06:51.857285   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 01:06:51.858800   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859259   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859380   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859437   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:14:52.450027   490   490 object_manager.cc:233] The object manager with ID a9f645669d1cbcc66f116857fc64d00cfbf9f245 is trying to pull object 5d4202bb609c5c6e25e7fd4c0100000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
F1014 01:17:02.833871   490   490 node_manager.cc:2833]  Check failed: actor_registry_.count(id.ActorId()) > 0
*** Check failure stack trace: ***
    @     0x56294c1749cd  google::LogMessage::Fail()
    @     0x56294c175b2c  google::LogMessage::SendToLog()
    @     0x56294c1746a9  google::LogMessage::Flush()
    @     0x56294c1748c1  google::LogMessage::~LogMessage()
    @     0x56294c12cad9  ray::RayLog::~RayLog()
    @     0x56294be4d51d  ray::raylet::NodeManager::HandleObjectLocal()
    @     0x56294be4d6c5  _ZNSt17_Function_handlerIFvRKN3ray14object_manager8protocol11ObjectInfoTEEZNS0_6raylet11NodeManagerC1ERN5boost4asio10io_contextERKNS0_8ClientIDERKNS7_17NodeManagerConfigERNS0_13ObjectManagerESt10shared_ptrINS0_3gcs9GcsClientEESL_INS0_24ObjectDirectoryInterfaceEEEUlS5_E1_E9_M_invokeERKSt9_Any_dataS5_
    @     0x56294be820df  _ZN5boost4asio6detail18completion_handlerIZN3ray30ObjectStoreNotificationManager15ProcessStoreAddERKNS3_14object_manager8protocol11ObjectInfoTEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x56294c47246f  boost::asio::detail::scheduler::do_run_one()
    @     0x56294c473971  boost::asio::detail::scheduler::run()
    @     0x56294c4749a2  boost::asio::io_context::run()
    @     0x56294bd97fde  main
    @     0x7ffb5ef03b97  __libc_start_main
    @     0x56294bdaebb1  (unknown)

ericl commented 4 years ago

@waldroje @soundway I was investigating this and realized there is a high likelihood the issue is the same as https://github.com/ray-project/ray/issues/11309, which is triggered by creating and destroying many actors.

To validate this, could you try the following?

Instead of running with ASHAScheduler, run with just a small number of QBert trials that run forever (so that no actors are destroyed during the run)?

If this doesn't crash, then it is likely the same issue as 11309. If not, then it would be great if you could include the simplified script that triggers the crash without ASHAScheduler (ideally, simplified to just a rllib train command instead of a script).

ericl commented 4 years ago

Oh forgot to mention, you can also try the ulimit workaround in that issue: https://github.com/ray-project/ray/issues/11309

waldroje commented 4 years ago

I’ll run some stuff tomorrow morning and revert

On Oct 19, 2020, at 6:39 PM, Eric Liang notifications@github.com wrote:

Oh forgot to mention, you can also try the ulimit workaround in that issue: #11309 https://github.com/ray-project/ray/issues/11309 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-712480607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMLQU3A2DM7F765FCI3SLS5ZXANCNFSM4SGQX3BA.

ericl commented 4 years ago

@soundway @waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug.

I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.

waldroje commented 4 years ago

I upped my file descriptors to ~16k… but still crashing, but getting more specific error now.. not something I’ve seen before..

Ray worker pid: 24628 WARNING:tensorflow:From /home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariabl e.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2020-10-21 00:25:36,762 ERROR worker.py:372 -- SystemExit was raised from the worker Traceback (most recent call last): File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2328, in get_attr pywrap_tf_session.TF_OperationGetAttrValueProto(self._c_op, name, buf) tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 331, in _MaybeCompile xla_compile = op.get_attr("_XlaCompile") File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2332, in get_attr raise ValueError(str(e)) ValueError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'.

Is the gist of it…. It’s late here, and I’ll post a more complete set of logs, and make sure I am not making any mistakes.. but figured I’d pas that along...

On Oct 20, 2020, at 9:12 PM, Eric Liang notifications@github.com wrote:

@soundway https://github.com/soundway @waldroje https://github.com/waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug.

I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-713229044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMNMMR2DMGHPPFRHFHTSLYYRXANCNFSM4SGQX3BA.

soundway commented 4 years ago

@ericl I did some tests and I have some doubt that this is the root cause for our problem for several reasons. Though there might be alternative reasons for my observations.

The ulimit for file descriptor per process for our system is already at 65K according to ulimit -a (and both hard/soft limit are the same at 65K). Unfortunately, I couldn't make this any higher due to our infra setup, but this should be more than sufficient. I'm running a few Q-bert jobs with long training iterations and the FIFO scheduler, but it will take a while to know whether they will crash. Will keep you updated.

That said, I looked at the number of file descriptors of two other types of jobs where eventually they tend to crash after a long time.

Training Q-bert with the last reproduction script I provided (i.e. creating trials with very high frequency).
Our own custom environment (according to our setup at the beginning of the thread).

Unfortunately in our system we can't use lsof, so I'm using some approximations.

For setup 1, the largest observed fd number per process doesn't exceed 1700. So the 65K limit per process in theory should be far more than enough. I got it via:

ls /proc/*/task/*/fd /proc/*/fd | sort -nr | head -n 10

The overall number of opened file descriptors according to the command below never exceeds 200K, but I do observe very large fluctuations between trial resets.

cat /proc/sys/fs/file-nr

For setup 2, the max fds per process and the total number of fds when I use the commands above don't exceeds 600/140K respectively.

In either case, the max number of fds per process is far less than the 65K threshold. The max number of fds across the system is also far less than the 78M allowed. Maybe something eventually would trigger a sudden increase in fd usage.

In the past, we observed jobs that crash somewhat early during training (with relatively long min trial durations), and it didn't appear that it was during a trial reset.

Based on these preliminary data, it appears that our problem won't be solved easily by just increasing ulimit (and we couldn't increase it easily any more anyway). Please let me know if you think there's any issue with the way I approached this.

ericl commented 4 years ago

Thanks for the investigations @soundway. It does indeed sound like a separate issue.

I'm trying to think of how to best reproduce this. Based on what you said, it sounds like it's potentially something triggered by the characteristics of a custom environment, and potentially TensorFlow related. To help me isolate the cause and hopefully get a fast repro, could you try the following?

Mock out the environment (e.g., just return np.zeros() for all observations, etc), so that you can share a fast reproduction publicly.
See if the problem is reproducible on a single machine (e.g., p2.16xl), or if only happens in the distributed setting.
See if the problem also happens if you switch to "framework": "torch" for PPO, to see if it's TF specific.
See if removing ray.tune.schedulers.ASHAScheduler helps (basically, run a fixed set of trial(s) forever instead of going through a Tune scheduler).

Probably (1) and (2) are the most critical, since if there's a standalone repro script that runs in a few minutes, I can always try the other steps myself pretty easily.

soundway commented 4 years ago

For (1) I have tried something like that (with varying obs size, model size, environment memory occupancy), but unfortunately it still takes roughly the same amount of time to crash, and I could not get it to crash within minutes. Though I haven't done this too rigorously and I can revisit this.

I haven't tried running long jobs in the cloud with just one instance yet in the but I can try that.

I never use TF -- everything I've done here is with Torch (1.6 specifically, but we also see problem with 1.4). I could try TF as well.

I started (4) with current setup with Q-bert and they haven't crashed yet, but will give you updates on this.

ericl commented 4 years ago

I see, minutes would be ideal but even a couple hours isn't too bad. 40 hours is a bit hard to deal with though :/

waldroje commented 4 years ago

I tried various configurations today, and I increased my soft/hard ulimit -n to ~63k on head & workers…

Probably the only thing I can say is that for the most part, the same Trial, using fairly limited resources, 32 workers… ran fine on a small cluster, roughly 32-150 cpus… but as the number nodes increased the Trial would begin breaking, mostly right after starting, prior to completing a complete iteration…. Yet a couple times I was able to get the Trial to run for a long time on a larger cluster…

At no time can I ever get large resource usage on a large cluster… everything breaks… yet I can’t seem to come up with an example to replicate, as my MockEnv runs fine…so clearly something about the custom env/model relative to the basic MockEnv that creates problems…

On Oct 21, 2020, at 2:15 PM, Eric Liang notifications@github.com wrote:

I see, minutes would be ideal but even a couple hours isn't too bad. 40 hours is a bit hard to deal with though :/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-713770664, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMJWGAXDHZUPIPDSS3TSL4QLBANCNFSM4SGQX3BA.

ericl commented 4 years ago

@stephanie-wang Looking through the above posted logs more carefully, there are a couple things that look odd to me:

We seem to be attempting to reconstruct objects, but for RLlib all objects are either ray.put or actor returns. Shouldn't this code path never be triggered?
I'm wondering if the assert here may be incorrect due to race conditions, triggered by GCS actor management. Maybe we should remove it? https://github.com/ray-project/ray/blame/5d7f271e7d275d74e322c146ecc80cc9346238dd/src/ray/raylet/node_manager.cc#L2787

Note that (2) would go away with the new scheduler.

soundway commented 4 years ago

@ericl For 1., in our custom training/environment, we have some code in the training loop that dispatches remote calls and fetches the results via ray.get. Removing that code will still lead to crashes (but keeping it does seem to make it fail faster in general). The version of the log I provided I believe was using code that included the customization.

The Qbert reproduction script doesn't contain any of such customization. I didn't find any reconstruction related messages in the failed Qbert logs other than the GCS attempting to reconstructing actors after a node has failed.

ericl commented 4 years ago

Ok, it could be the reconstruction is just a symptom of the node failure initially caused by the check failure.

On Thu, Oct 22, 2020, 11:41 PM soundway notifications@github.com wrote:

@ericl https://github.com/ericl For 1., in our custom training/environment, we have some code in the training loop that dispatches remote calls and fetches the results via ray.get. Removing that code will still lead to crashes (but keeping it does seem to make it fail faster in general). The version of the log I provided I believe was using code that included the customization.

The Qbert reproduction script doesn't contain any of such customization. I didn't find any reconstruction related messages in the failed Qbert logs other than the GCS attempting to reconstructing actors after a node has failed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-714981403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSWYMJFKEF7VXR2PKUDSMEQP7ANCNFSM4SGQX3BA .

ericl commented 4 years ago

@soundway could you try again with the assert removed? (you can use the nightly wheels)

soundway commented 4 years ago

Sure, I can try that. I suppose that would enable me to be more aggressive about resetting trials.

A bit of update on some of the jobs I run lately. For Q-Bert, I launched several jobs without trial reset, and after almost a week, none has crashed so far. I don't think I had any past Q-bert jobs survived this long when training with the ASHA scheduler.

On the other hand, I also launched some jobs with our environment with long trial duration, and they still crashed after hours to days. Though on average they seem to take longer to crash, but it's a small sample size.

So I guess the conclusion here is that while resetting trials seems to accelerate the crashing, it's not a necessary condition.

ericl commented 4 years ago

long trial duration, and they still crashed after hours

To clarify, is this infinite duration or just long? The error messages you got earlier should only be occuring at the boundary of trial termination / startup, so I'd be surprised to see them if the trials are running forever.

(I'd be interested in the error logs for crashed infinite duration trials if available).

soundway commented 4 years ago

@ericl Just long. I set a really large stopping criterion for training_iteration (1M in this case). And according to the Tune output when it crashed, the job that got the error is at around ~7K steps. The only interesting error message I found is again related to the mmap_records for fd 0 in raylet.err. Unless the training_iteration criterion does something different than what I expect, none of the trials should've been terminated.

Also, maybe you are referring to the error @waldroje posted? I think things may have gotten a little bit mixed up here in this thread with reports from the other issue.

soundway commented 4 years ago

Lemme just change my avatar to make it easier to parse this thread. Our avatars happen to look the same...

ericl commented 4 years ago

I see, so it seems we still don't have a good root cause for the mmap crash (which is not related to the actor check failure?)

Btw

For 1., in our custom training/environment, we have some code in the training loop that dispatches remote calls and fetches the results via ray.get. Removing that code will still lead to crashes (but keeping it does seem to make it fail faster in general). The version of the log I provided I believe was using code that included the customization.

Back on the subject of a sharable repro, how about making a fake env that still dispatches the remote calls (with similar sized zero-out objects etc)? You mentioned a mock env before was not able to reproduce. I suspect the remote calls are critical to reproducing if the error is mmap/object store related.

waldroje commented 4 years ago

apologies, been offline for a couple days... not 100% sure @soundway problem and mine are the same, although my error traces look exactly the same... but I tried a bunch of iterations today... different combinations of nodes to make sure it wasn't a specific node causing problems... but I was able to run trials under many scenarios of 4-6 nodes... but when I go to 18 nodes, my trials die immediately, point being I was able to run both locally and remotely, but yet fails at some sort of resource requirement... since it fails immediately, i'm attaching a copy of the log directory... let me know if anything there is helpful... I'm still trying to produce a mock env that reproduces this problem... logs.zip

soundway commented 4 years ago

@ericl Didn't get to spend too much time on this in the last week but thought I should give an update on my part.

Regarding the dispatches of remote calls -- that was one of my earlier suspicions as well, so I wrote some similar versions for the Q-bert/mocked environments quite early on. Unfortunately, its inclusion doesn't seem to have much impact (and I just recently tested them again and observed the same thing). In fact, if anything the data suggests that adding these extra calls seems to lengthen the time for things to crash. Our own environment does a lot of things with processes/threads/files/network, and have many dependencies. So from the looks of it so far, those are more likely to be related to the crash than these remote calls.

I've been unsuccessful at creating a script that can reproduce the issue on the order of minutes on a single instance with vanilla/mock environment. On a p2.16xlarge instance, I was able to quite consistently recreate the issue within a day if I reset the trials every 5 iterations. However, I sift through the error logs and couldn't find anything interesting in there (i.e. no errors like mmap_records for fd 0). So this could be a separate issue.

Here's the slightly modified code from earlier:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo

if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 2048,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 64,
        "num_gpus": 2,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.20,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                    config=config,
                    fail_fast=True,
                    reuse_actors=False,
                    queue_trials=True,
                    num_samples=10000,
                    scheduler=ray.tune.schedulers.ASHAScheduler(
                        time_attr='training_iteration',
                        metric='episode_reward_mean',
                        mode='max',
                        max_t=5,
                        grace_period=5,
                        reduction_factor=3,
                        brackets=3),
                    )

OrlMat commented 4 years ago

Hi,

We did observed very similar behavior on our end after updating to 0.8.7 and 1.0.0. We mainly utilize PPO with custom build cluster consisting of four machines. We as well use custom environment. Problem gets more frequent after using PPO in multiagent configuration. We as well observed some correlation with train_batch_size - in current multiagent case value of 1024 * 256 works fine and stable - at least for 48h+. For twice the size of batch size (2 * 1024 * 256) we do observe slow degradation of learning process- number of healthy workers decreases over time as on the following graph:

Screenshot from 2020-11-10 10-39-44

Not sure if this is obvious or helpful, but with each decrease in healthy actors count the 'step' counter is reset to zero- below is the same graph of number of healthy workers but in the function of step, not time: Screenshot from 2020-11-10 10-52-46

When we did set 'train_batch_size' to '4 1024 256' training process fails after first optimization. Important not is that we successfully utilized last mentioned 'train_batch_size' in the past without stability issues in case of single agent training. We are as well below physical limits of cluster by a far margin, however increasing environment demands has negative effect for stability for sure. However as pointed already, issue is hard in debugging - however if you would like to we may conduct some guided experiments in our configuration.

liside commented 4 years ago

Hi there, I'm experiencing similar actor exited unexpectedly errors but with RaySGD. I was new to ray and got started straight from version 1.0.0 with Python3.7. I mainly use RaySGD and Tune to run data-parallel training on CPUs in a cluster of beefy ubuntu nodes. I'm dealing with tabular data so I preload them into the memory store and let actors read from there. The entire training process is expected to be long (dozens of hours at least), but I would always get some actor exited errors after first 1~2 hours causing the entire workload to shut down. The RAM usage on each node was never above 50%. I've also set ulimit -n to 2^20 (1 million).

roireshef commented 3 years ago

Hi there, I'm experiencing similar (although might not be exactly the same) issue with both gcs reporting a stream of : gcs_server.out:I1117 17:50:03.524953 59 59 gcs_worker_manager.cc:32] Reporting worker failure, worker id = 9e5bc2ff1515401c181db192ee0a0d3dfccd5a48, node id = ec63ba9921092c53a281629a004c547f7f8353bc, address = 10.120.253.12 and monitor reporting a stream of: Monitor: could not find ip for client c77390ddb83f252b89c3d4d412d30b45c0e96f2a

Similar to @waldroje above, this doesn't happen with 3-4 nodes in a cluster, but starts when number of nodes is increased. This didn't happen when used v0.7.3 but does happen with latest master and v1.0.1

Strangely, Tune reports 628 cores to be occupied over 8 trials, but only 6 trials are actually reporting stats with total of 434 occupied cores:

== Status ==
Memory usage on this node: 15.4/187.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 628/640 CPUs, 0/0 GPUs, 0.0/2615.67 GiB heap, 0.0/776.37 GiB objects
Result logdir: /home/kz430x/ray_results/stress_testing_fewer_nodes
Number of trials: 9/18 (1 PENDING, 8 RUNNING)
+-------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                                            | status   | loc               |   num_envs_per_worker |   num_workers |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00000 | RUNNING  | 10.120.253.69:80  |                     3 |            31 |                        50 |     99 |          2178.26 |  998152 | 0.515539 |             1.48895  |             0.410766 |           32.9055  |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00001 | RUNNING  | 10.120.252.234:80 |                     5 |            31 |                        50 |     83 |          2185.02 |  838823 | 0.671696 |             0.671751 |             0.66081  |           50       |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00002 | RUNNING  | 10.120.252.177:79 |                     3 |            93 |                        50 |    136 |          2162.26 | 1367871 | 0.442033 |             0.485261 |             0.441982 |            6.00777 |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00003 | RUNNING  | 10.120.253.35:78  |                     5 |            93 |                        50 |    111 |          2175.79 | 1122076 | 0.671751 |             0.671751 |             0.671751 |           50       |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00004 | RUNNING  |                   |                     3 |           155 |                        50 |        |                  |         |          |                      |                      |                    |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00005 | RUNNING  | 10.120.253.51:81  |                     5 |           155 |                        50 |    109 |          2183.33 | 1102251 | 0.508532 |             1.35241  |             0.355423 |           18       |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00006 | RUNNING  | 10.120.252.203:80 |                     3 |            31 |                       100 |     73 |          2209.38 |  744252 | 0.54014  |             0.602843 |             0.517977 |           10.0118  |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00007 | RUNNING  |                   |                     5 |            31 |                       100 |        |                  |         |          |                      |                      |                    |
| A3C_NegotiationMergeHandoffFullLateralEnv_4e806_00008 | PENDING  |                   |                     3 |            93 |                       100 |        |                  |         |          |                      |                      |                    |
+-------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+

@ericl I'm not sure the ulimit solution is relevant in my case, since this is the output of ulimit -a inside my docker container:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 18446744073709551597
file size               (blocks, -f) unlimited
pending signals                 (-i) 768188
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 40960
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

roireshef commented 3 years ago

I'm attaching my logs:

logs.tar.gz

roireshef commented 3 years ago

@OrlMat what's the last stable Ray version you managed to make it work at scale...? Did you say v0.8.6 doesn't work for you either??

OrlMat commented 3 years ago

@OrlMat what's the last stable Ray version you managed to make it work at scale...? Did you say v0.8.6 doesn't work for you either??

We started to observe issues after upgrade to v0.8.7 - I have already edited the post.

ericl commented 3 years ago

I'll spend more time trying to reproduce this. Seems like I might just have to bite the bullet of waiting a couple days for a crash.

As usual, if anyone has a fast reproducible crash that would be very helpful.

ericl commented 3 years ago

Btw @soundway there is some possibility part of your issue is solved by https://github.com/ray-project/ray/pull/11869

Which is in latest wheels. However, that issue should manifest immediately and not after a long period of time.

rkooo567 commented 3 years ago

@OrlMat @roireshef @soundway Also, can you guys give us logs that are from the crashed nodes? It seems like everyone's issue was basically stemmed from random node failures that we don't know why they happen.

ericl commented 3 years ago

Quick update, I am currently running the following on a 10-node cluster:

rllib train --run=PPO --env=CartPole-v0 --config={"num_workers": 32} --stop={"timesteps_total": 1000} --num-samples=10000 --ray-address=auto -vv

After a few hours (2000 / 10000 trials complete), the raylet is at 100% CPU which is causing heartbeat delays. This seems likely to result in random node failures if raylet CPU load continues to increase.

@OrlMat @roireshef @soundway do you also observe 100% CPU usage of raylet in your workloads after a while?

roireshef commented 3 years ago

I'll spend more time trying to reproduce this. Seems like I might just have to bite the bullet of waiting a couple days for a crash.

As usual, if anyone has a fast reproducible crash that would be very helpful.

@ericl In my setup, just as I start the Tune cluster I immediately observe issues, like trials not starting, although marked as RUNNING. After a while (can be ~1 day) they are marked as FAILED. Please see the following output (from ~1 day since beginning of Tune run). Just to give some context, this is 32cores/node (CPU-only) with ~128GB RAM cluster that has enough nodes/cores to be able to run all trials simultaneously.


+------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+
| Trial name                                                 | status   | loc               |   num_envs_per_worker |   num_workers |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |
|------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------|
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00000 | ERROR    |                   |                     3 |            31 |                        50 |    570 |          4869.03 | 2897185 | 0.686032 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00001 | RUNNING  | 10.120.253.10:69  |                     5 |            31 |                        50 |    480 |          6518.52 | 2470741 | 0.705774 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00002 | RUNNING  | 10.120.253.5:81   |                     3 |            93 |                        50 |    714 |          4301.41 | 3628635 | 0.748065 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00003 | RUNNING  |                   |                     5 |            93 |                        50 |        |                  |         |          |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00004 | RUNNING  | 10.120.252.231:75 |                     3 |           155 |                        50 |   1587 |          1951.24 | 8107295 | 0.671751 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00005 | RUNNING  | 10.120.253.66:74  |                     5 |           155 |                        50 |    453 |          6440.03 | 2283493 | 0.540052 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00006 | RUNNING  | 10.120.252.198:79 |                     3 |            31 |                       100 |    433 |          6805.06 | 2244494 | 0.679827 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00007 | RUNNING  | 10.120.252.243:89 |                     5 |            31 |                       100 |    310 |          9105.32 | 1588171 | 0.507826 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00008 | RUNNING  | 10.120.253.53:85  |                     3 |            93 |                       100 |    443 |          6774.8  | 2279045 | 0.50638  |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00009 | RUNNING  | 10.120.252.222:79 |                     5 |            93 |                       100 |    330 |          8890.16 | 1680894 | 0.513259 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00010 | RUNNING  | 10.120.252.229:76 |                     3 |           155 |                       100 |    473 |          6002.71 | 2437666 | 0.542599 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00011 | RUNNING  | 10.120.252.198:82 |                     5 |           155 |                       100 |    312 |          8455.52 | 1593173 | 0.417053 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00012 | RUNNING  | 10.120.252.216:77 |                     3 |            31 |                       200 |    270 |          8964.67 | 1484697 | 0.488652 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00013 | RUNNING  | 10.120.252.201:61 |                     5 |            31 |                       200 |    151 |         10504.3  |  762662 | 0.458328 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00014 | RUNNING  | 10.120.252.202:78 |                     3 |            93 |                       200 |    261 |          8668.87 | 1437558 | 0.478272 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00015 | ERROR    |                   |                     5 |            93 |                       200 |    127 |          9100.73 |  640695 | 0.432807 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00016 | RUNNING  | 10.120.252.209:76 |                     3 |           155 |                       200 |    249 |          9255.21 | 1368668 | 0.437783 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00017 | RUNNING  | 10.120.253.25:87  |                     5 |           155 |                       200 |     46 |         21294.9  |  233320 | 0.596508 |
+------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+
Number of errored trials: 2
+------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                                                 |   # failures | error file                                                                                                                                                                                          |
|------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00000 |            1 | /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_0_num_envs_per_worker=3,num_workers=31,rollout_fragment_length=50_2020-11-17_20-47-526qieqnu0/error.txt   |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00015 |            1 | /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_15_num_envs_per_worker=5,num_workers=93,rollout_fragment_length=200_2020-11-17_20-47-53s2c_xath/error.txt |
+------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Errors:

$kz430x@dcmixphhpc009:~> cat /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_0_num_envs_per_worker=3,num_workers=31,rollout_fragment_length=50_2020-11-17_20-47-5qieqnu0/error.txt
Failure # 1 (occurred at 2020-11-18_03-14-29)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 468, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1474, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::new_base.train() (pid=68, ip=10.120.253.26)
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 497, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 486, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 261, in train
    result = self._train()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 132, in _train
    return self._train_exec_impl()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 170, in _train_exec_impl
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 731, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 814, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 814, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 799, in add_wait_hooks
    item = next(it)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 525, in base_iterator
    batch = ray.get(obj_id)
ray.exceptions.UnreconstructableError: Object 0d4ced372d8e5c5ffc7dfb5a010000c001000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://docs.ray.io/en/latest/memory-management.html

roireshef commented 3 years ago

@ericl - please also see the total time column. I've verified the trials that ARE running have started more on less (+-20sec) at the same time. The total time reported for some trials is completely off... Actually this was sampled 6-7 hours after the beginning of the experiment, so the only one shows total time close the actual time passed is the last one. All other trials are way off.... I hope this helps in any way

roireshef commented 3 years ago

@ericl @rkooo567 I'll make sure to provide you with an reproducible example soon, but note that this doesn't happen when using single machine, so you'll need to run Tune over a cluster of few Ray nodes.

roireshef commented 3 years ago

@ericl @rkooo567 to reproduce, please install FLOW package using the instructions in: https://flow.readthedocs.io/en/latest/flow_setup.html#local-installation-of-flow - NOTE: I skipped installing conda and installed it in my docker contrainer instead.

Next, clone the environment example from : https://github.com/roireshef/flow_rllib_public

Finally, setup your Ray cluster, add ray.init() command to the tune script, and you should be ready to go. Note that issues occur when using relatively large clusters (use +20 nodes, 32cores each) so that multiple trials are running simultaneously. I use for head node:

ray start --block --head \
--port=$redis_port \
--redis-password=$redis_password \
--node-ip-address=$head_node_ip \
--gcs-server-port=6005 \
--dashboard-port=6006 \
--node-manager-port=6007 \
--object-manager-port=6008 \
--redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 \
--min-worker-port=6100 --max-worker-port=6299 \
--include-dashboard=false

And for workers:

ray start --block \
--address=$redis_address \
--redis-password=$redis_password \
--node-ip-address=$worker_node_ip \
--node-manager-port=6007 \
--object-manager-port=6008 \
--min-worker-port=6100 --max-worker-port=6299

ericl commented 3 years ago

Thanks @roireshef , I'll try those next.

For now, seeing issues with plain PPO on CartPole

rllib train --run=PPO --env=CartPole-v0 --config='{"num_workers": 32}' --num-samples=100 --ray-address=auto -vv

ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-30-0-248 is used (59.69 / 62.79 GB). The top 10 memory consumers are:

PID MEM COMMAND
73246   56.69GiB    ray::PPO

Going to try upgrading tensorflow (realized this node was using 1.15 which is pretty old).

Update: this appears to be resolved with TF2.1.

ericl commented 3 years ago

It seems Ray nightly fixes the raylet 100% CPU bug now. I'm now running the following:

10000 short-lived PPO/Pong-v0 trials on a cluster of 30 m4.4xl workers (16 CPUs each). Tensorflow 2.1 / Ray nightly.
@roireshef 's simple_tune.py on a cluster of 20 m4.4xl workers (16 CPUs each). Tensorflow 2.1 / Ray nightly.
The Qbert "40 hours to crash" repro script above on 30 m4.4xl workers, but modified to use the default model, 1 CPU per worker, 128 workers, and no GPU.

Status: (1) -- ran ok up to 4k trials, eventually crashed with unrelated Tune bug: https://github.com/ray-project/ray/issues/12227 (2) -- ? (3) -- ?

OrlMat commented 3 years ago

Here are the logs from our last run, which resulted in crash of ~10 workers after first train batch collection. I did not included logs from workers because they consist a lot of lines logged from our environment. We have used _ray-1.1.0.dev0-cp37-cp37m-manylinux2014_x8664.whl, which btw seems to fails / freezes after first optimization (actors still collects data but no further optimizations are triggered).

session_2020-11-19_14-55-09_917063_540.zip