Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
RLlib A3C tests are failing with eviction errors in master since #6763. An object required for an actor construction task seems to have been evicted in error. This probably got overlooked since travis skips RLlib tests on core changes.
cc @wumuzi520 @stephanie-wang can you guys look into this? #6763 seems slightly tricky to revert, so it seems better to try to fix forward if possible, but if this is hard I can make a revert PR.
Reproduction (REQUIRED)
You can run pytest -v -s test_supported_spaces.py to reproduce. To speed up the test, you can comment out all other tests except for test_a3c, which is the failing one:
=== Testing A3C (torch=False) A=Discrete(5) S=Box(210, 160, 3) ===
2020-04-17 18:11:43,721 INFO trainable.py:217 -- Getting current IP.
2020-04-17 18:11:43,721 WARNING util.py:37 -- Install gputil for GPU system monitoring.
2020-04-17 18:11:54,200 WARNING worker.py:1065 -- The task with ID ffffffffffffffffffffffff0100 is a driver task and so the object created by ray.put could not be reconstructed.
(pid=27937) unable to import 'smart_open.gcs', disabling that module
ray::RolloutWorker.par_iter_init() (pid=27937, ip=192.168.5.121)
File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
ray.exceptions.UnreconstructableError: Object ffffffffffffffffffffffff0100008809000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://ray.readthedocs.io/en/latest/memory-management.html
Traceback (most recent call last):
File "/tmp/x.py", line 112, in check_support
a.train()
File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer.py", line 502, in train
raise e
File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer.py", line 491, in train
result = Trainable.train(self)
File "/home/eric/Desktop/ray/python/ray/tune/trainable.py", line 261, in train
result = self._train()
File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer_template.py", line 142, in _train
return self._train_exec_impl()
File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer_template.py", line 174, in _train_exec_impl
res = next(self.train_exec_impl)
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 634, in __next__
return next(self.built_iterator)
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 644, in apply_foreach
for item in it:
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 685, in apply_filter
for item in it:
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 644, in apply_foreach
for item in it:
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 670, in add_wait_hooks
item = next(it)
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 644, in apply_foreach
for item in it:
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 448, in base_iterator
actor_set.init_actors()
File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 1002, in init_actors
ray.get([a.par_iter_init.remote(self.transforms) for a in self.actors])
File "/home/eric/Desktop/ray/python/ray/worker.py", line 1488, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::RolloutWorker.par_iter_init() (pid=27937, ip=192.168.5.121)
File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
ray.exceptions.UnreconstructableError: Object ffffffffffffffffffffffff0100008809000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://ray.readthedocs.io/en/latest/memory-management.html
ERROR
Looks like it's probably because the GCS Service responds immediately to the caller of the actor creation task instead of waiting until the actor creation task has finished.
What is the problem?
RLlib A3C tests are failing with eviction errors in master since #6763. An object required for an actor construction task seems to have been evicted in error. This probably got overlooked since travis skips RLlib tests on core changes.
cc @wumuzi520 @stephanie-wang can you guys look into this? #6763 seems slightly tricky to revert, so it seems better to try to fix forward if possible, but if this is hard I can make a revert PR.
Reproduction (REQUIRED)
You can run
pytest -v -s test_supported_spaces.py
to reproduce. To speed up the test, you can comment out all other tests except fortest_a3c
, which is the failing one: