ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.36k stars 5.65k forks source link

GCS actor management introduces reference counting bug #8076

Closed ericl closed 4 years ago

ericl commented 4 years ago

What is the problem?

RLlib A3C tests are failing with eviction errors in master since #6763. An object required for an actor construction task seems to have been evicted in error. This probably got overlooked since travis skips RLlib tests on core changes.

cc @wumuzi520 @stephanie-wang can you guys look into this? #6763 seems slightly tricky to revert, so it seems better to try to fix forward if possible, but if this is hard I can make a revert PR.

Reproduction (REQUIRED)

You can run pytest -v -s test_supported_spaces.py to reproduce. To speed up the test, you can comment out all other tests except for test_a3c, which is the failing one:

=== Testing A3C (torch=False) A=Discrete(5) S=Box(210, 160, 3) ===
2020-04-17 18:11:43,721 INFO trainable.py:217 -- Getting current IP.
2020-04-17 18:11:43,721 WARNING util.py:37 -- Install gputil for GPU system monitoring.
2020-04-17 18:11:54,200 WARNING worker.py:1065 -- The task with ID ffffffffffffffffffffffff0100 is a driver task and so the object created by ray.put could not be reconstructed.
(pid=27937) unable to import 'smart_open.gcs', disabling that module
ray::RolloutWorker.par_iter_init() (pid=27937, ip=192.168.5.121)
  File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
ray.exceptions.UnreconstructableError: Object ffffffffffffffffffffffff0100008809000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://ray.readthedocs.io/en/latest/memory-management.html
Traceback (most recent call last):
  File "/tmp/x.py", line 112, in check_support
    a.train()
  File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer.py", line 502, in train
    raise e
  File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer.py", line 491, in train
    result = Trainable.train(self)
  File "/home/eric/Desktop/ray/python/ray/tune/trainable.py", line 261, in train
    result = self._train()
  File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer_template.py", line 142, in _train
    return self._train_exec_impl()
  File "/home/eric/Desktop/ray/python/ray/rllib/agents/trainer_template.py", line 174, in _train_exec_impl
    res = next(self.train_exec_impl)
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 634, in __next__
    return next(self.built_iterator)
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 644, in apply_foreach
    for item in it:
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 685, in apply_filter
    for item in it:
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 644, in apply_foreach
    for item in it:
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 670, in add_wait_hooks
    item = next(it)
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 644, in apply_foreach
    for item in it:
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 448, in base_iterator
    actor_set.init_actors()
  File "/home/eric/Desktop/ray/python/ray/util/iter.py", line 1002, in init_actors
    ray.get([a.par_iter_init.remote(self.transforms) for a in self.actors])
  File "/home/eric/Desktop/ray/python/ray/worker.py", line 1488, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::RolloutWorker.par_iter_init() (pid=27937, ip=192.168.5.121)
  File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 421, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
ray.exceptions.UnreconstructableError: Object ffffffffffffffffffffffff0100008809000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://ray.readthedocs.io/en/latest/memory-management.html

ERROR
stephanie-wang commented 4 years ago

Looks like it's probably because the GCS Service responds immediately to the caller of the actor creation task instead of waiting until the actor creation task has finished.