Closed nilsjohanbjorck closed 4 years ago
I think this is expected since latest
is currently using an unstable backend.
Can you try using 0.5.3 instead? FYI @stephanie-wang @robertnishihara
Looks like this is a memory corruption issue in the ray backend, caused by #2959. We'll push a fix in the 0.6 release.
Great catch 👍
Sorry for the interruption, I haven't been able to look into this until now. After installing the latest wheel via pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp36-cp36m-manylinux1_x86_64.whl
and pulling the code via git clone https://github.com/ray-project/ray.git
the issue seems to persists. I've been trying to run IMPALA on two nodes and get a similar error:
Remote function train failed with:
Traceback (most recent call last):
File "/home//miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/worker.py", line 848, in _process_task
*arguments)
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 319, in train
return Trainable.train(self)
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
result = self._train()
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/agents/impala/impala.py", line 103, in _train
self.config["collect_metrics_timeout"])
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/optimizers/policy_optimizer.py", line 102, in collect_metrics
timeout_seconds=timeout_seconds)
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/rllib/evaluation/metrics.py", line 38, in collect_episodes
metric_lists = ray.get(collected)
File "/home/miniconda3/envs/rayrl_4d49/lib/python3.6/site-packages/ray/worker.py", line 2353, in get
raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000046def8798cbcf87b647e8bedb82f93b5). It was created by remote function <unknown> which failed with:
Remote function <unknown> failed with:
Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.
A worker died or was killed while executing task 0000000079065e620b1b20ea1f90e64592cbddbb.
A worker died or was killed while executing task 00000000da5aa0b1d59e98aa26474ef5cd995f8f.
A worker died or was killed while executing task 000000005a21e4ff1f383c81461f875892acd5ed.
I'd imagine that this is related, should I open a new issue?
Thanks!!!
Also, I should mention that I'm unable to get version 0.5.3 to work due to issue https://github.com/ray-project/ray/issues/3334. It might be possible to fix it on the old-source code, but I'd prefer working on master.
Issue #3334 is being fixed in https://github.com/ray-project/ray/pull/3379. Note that you should be able to just downgrade your version of Redis in the meantime.
Got it! The latest Remote function <unknown> failed with: Invalid return value: ...
issue I have is orthogonal to https://github.com/ray-project/ray/issues/3334 as I ran it with the latest source code, in which https://github.com/ray-project/ray/issues/3334 is fixed.
I've been trying to run APE-X on 4 nodes similar to p3.16xlarge on AWS running centOS. Ray is installed via
pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp36-cp36m-manylinux1_x86_64.whl
. The settings are asFor some reason, it seems like APE-X dies unexpectedly after roughly 5-10 minutes. I've attached the error-messages and the debug information just before it dies. Any idea what's going on here?