ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.21k stars 5.61k forks source link

[ray] Async actor not working in local mode. #8359

Open allenyin55 opened 4 years ago

allenyin55 commented 4 years ago

What is the problem?

Async actor is not being recognized in local mode. cc: @ijrsvt

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction

import ray 
ray.init(local_mode=True)

@ray.remote
class test_actor:
    async def start(self):
        return 1
actor = test_actor.remote()
ray.get(actor.start.remote())

The error I'm getting

E0507 17:19:21.404151 381615552 core_worker.cc:1082] Pushed Error with JobID: 0100 of type: task with message: ray::test_actor.start() (pid=60331, ip=192.168.0.7)
  File "python/ray/_raylet.pyx", line 464, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 465, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1136, in ray._raylet.CoreWorker.store_task_outputs
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 401, in serialize
    return self._serialize_to_msgpack(metadata, value)
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 373, in _serialize_to_msgpack
    self._serialize_to_pickle5(metadata, python_objects)
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 353, in _serialize_to_pickle5
    raise e
  File "/Users/allenyin/Test/ray/python/ray/serialization.py", line 350, in _serialize_to_pickle5
    value, protocol=5, buffer_callback=writer.buffer_callback)
  File "/Users/allenyin/Test/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/Users/allenyin/Test/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 617, in dump
    return Pickler.dump(self, obj)
TypeError: can't pickle coroutine objects at time: 1.5889e+09
rkooo567 commented 4 years ago

The problem is that when the task is executed locally, it doesn't create Fiber & set the current actor as async. The latter is easy to solve, but the first one makes the code path pretty messy because it requires core_worker to have FiberState class only for the local mode...

rkooo567 commented 4 years ago

@ijrsvt I think I can make a fix by tonight, so you don't need to work on this.

rkooo567 commented 4 years ago

Okay. I have been digging into this, and it is pretty tricky to fix because there's only one core worker for local mode. That says we cannot corrupt the core_worker state as async actor state. This requires some decent amount of refactoring (which I don't think it is worth taking time now). As you cannot use 0.8.5 until the next release anyway, I will postpone the fix to the next sprint and set the priority as P1.

ijrsvt commented 4 years ago

@rkooo567 I can help with it next sprint as well. @allenyin55 What is your use case with using async tasks in local mode ?

rkooo567 commented 4 years ago

@ijrsvt He should run the integration test with local mode, and his integration test contains an async actor.

ijrsvt commented 4 years ago

@rkooo567 Is this for a new integration test or an existing one? I don't know if there are a ton of use cases where local_mode and async actors will be used together. I'm not sure it fits in the definition of local_mode as emulating serial python?

rkooo567 commented 4 years ago

I guess @allenyin55 can answer better for the question. But I believe it was a new one, and he said he should use local mode. (btw, it worked when he used 0.8.4, and idk how)

I don't know well about the purpose of local mode, but my impression is that it is the most useful when you want to reduce the test load (meaning mostly for unit / integration test). If so, I believe it should return the same output as non-local-mode for every API.

(Also, there could be easy fix without using Fiber that just came up to my head. We can probably talk about this offline if you think we should fix this issue).

pcmoritz commented 4 years ago

I just did a bisection and this regression was introduced in https://github.com/ray-project/ray/pull/7670. We need to either fix it or give a better error message that async actors are not supported in local mode.

We use async actors in local mode for dependency injection during testing. Local mode makes sure that the test code runs in a single process, which allows us to mock certain methods in that process (which get called by Ray tasks).

rkooo567 commented 4 years ago

The fix could be actually pretty simple if we assume these 2 cases for local mode.

  1. All coroutines are scheduled and running synchronously (we don't actually support asynchronous operation in local mode).
  2. We don't support low-level asyncio APIs inside async actors (such as asyncio.get_event_loop()).

In this case, we just need to check if the function is coroutine and run the event loop + coroutine in the main thread until it is done. @pcmoritz @ijrsvt do you guys think it is a valid premise for local mode?

ijrsvt commented 4 years ago

@rkooo567 I think that is a great idea. It fits with the logic of local mode being serial python. It may be worth renaming it 'serial' mode to make its intended use case more obvious.

ericl commented 4 years ago

Downgrading to P2 since this is not a common use case.

nirvana-msu commented 3 years ago

I would argue with this not being a common use-case. Sure, you're not gonna run Ray in Local mode in production - but you may have to do it during development for debugging purposes. And since async actors are not supported in local mode - it means you simply cannot use them in your code. That is unless you're prepared to maintain two versions of your code - one with async actors for production, and one with sync actors for development... who would want to do that?

It would not have been an issue for me if not for debugging requirements. #14005 could be an alternative solution here (that would enable seamless debugging of workers/actors in PyCharm in non-local mode).

bkatwal commented 2 years ago

The same problem I am facing too. As I am testing my actor. Earlier I had used ray.init() unfortunately, this does not record test coverage. To test i changed to local_mode=True. Now the problem is my actor's remote function call never returns, as the actor code runs infinitely :(

dennymarcels commented 1 year ago

This feature would be greatly appreciated. I am trying to set up a WandbCallback to my tune pipeline but cannot make it work in debug (local) mode because the actor issue.

Vozf commented 4 months ago

I wouldn't say that's a rare usecase, because you basically can't debug with pycharm if you have any async actor. Strange that it's not prioritized and is already 4 years old