ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

The actor died unexpectedly before finishing this task ( Ray1.7.0 , Sagemaker ) #19885

Closed amirreza-one closed 2 years ago

amirreza-one commented 3 years ago

Hello,

I am running rllib on sagemaker with 8 cores, I set num_workers to 7. After a long execution I face "The actor died unexpectedly before finishing this task."

{
                "env": "RiveRL-v1",
                "run": "PPO",
                "config": {
                    "ignore_worker_failures": True,
                    "gamma": 0.6,
                    "num_sgd_iter": 5,
                    "lr": 0.0001,
                    "sgd_minibatch_size": 32768,
                    "train_batch_size": 100000,
                    "use_gae": False,
                    "num_workers": (self.num_cpus - 1),
                    "num_gpus": self.num_gpus,
                    "batch_mode": "complete_episodes",
                    "env_config": {
                        "window_size": 25,
                        "max_allowed_loss": 0.2
                    },
                    "observation_filter": "MeanStdFilter",
                    "entropy_coeff": 0.01,
                },
                "checkpoint_freq": 2,
            }

Failure # 1 (occurred at 2021-10-20_18-35-15) Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 467, in _process_trial result = self.trial_executor.fetch_result(trial) File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT) File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1517, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

but whenever I change num_worker to 1 the problem solves. Any idea how can I fix this issue?

moamenibrahim commented 3 years ago

I get the same error while running on ray 1.8.0 and python 3.8:

cat monitor.err:

Unable to use a TTY - input is not a terminal or the right kind of file
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Unable to use a TTY - input is not a terminal or the right kind of file
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
[2021-11-25 07:36:30,646 I 202 202] global_state_accessor.cc:394: This node has an IP address of 10.244.0.41, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container

cat worker-{..}-01000000-707.err:

2021-11-25 07:37:08,178 ERROR trial_runner.py:924 -- Trial trial_1: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 1627, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

These are the only errors I get, not sure what is the root cause?

worldveil commented 2 years ago

@moamenibrahim do you have a example repro script you can provide?

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!