ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.52k stars 5.69k forks source link

[Serve] Sequential ServeController.get_deployment_info calls impair request concurrency #35165

Open martalist opened 1 year ago

martalist commented 1 year ago

What happened + What you expected to happen

Nested @serve.deployments incur sequential overhead when called by their parent; for each child, ServeController.get_deployment_info is called twice. This harms concurrency and overall latency for requests.

In my own application I have observed parent handle_request latency being up to 8 times the (concurrent) child latency (in dashboard/metrics). See the example timeline below, showing the sequential get_deployment_info calls: image

Given the Router.infer method below is written for concurrency, I'd expect the serve framework to handle background tasks concurrently, too. Having inference latency so much higher for the parent (than children) is blocker for my application. Particularly as adding more child workers equates to more preprocessing latency.

Versions / Dependencies

Ubuntu Jammy, ray 2.4.0

Reproduction script

import asyncio

import ray
from fastapi import FastAPI
from pydantic import BaseModel
from ray import serve

app = FastAPI()

class Data(BaseModel):
    string: str

@serve.deployment(num_replicas=1)
class Worker:
    def __init__(self):
        self.counter = 0  # state, per worker

    async def batch_infer(self) -> int:
        await asyncio.sleep(0.01)
        self.counter += 1
        return self.counter

@serve.deployment(route_prefix="/", num_replicas=1)
@serve.ingress(app)
class Router:
    def __init__(self, workers: list[Worker]):
        self._workers = workers

    @app.post("/infer")
    async def infer(self, data: Data) -> int:
        result_refs = await asyncio.gather(
            *(worker.batch_infer.remote() for worker in self._workers),
        )
        results = ray.get(result_refs)
        return sum(results)

workers = [Worker.bind() for _ in range(4)]
router = Router.bind(workers)

Executed with RAY_PROFILING=1 serve run issue:router, requests made with your favourite HTTP lib, and timeline captured with ray timeline.

Issue Severity

High: It blocks me from completing my task.

sihanwang41 commented 1 year ago

Hi @martalist , do you mind giving a try in the master? we recently refactor the handle construction a little bit, this potentially is mitigated. LMK if this issue still exists, I can take a deeper look, thank you very much for posting this question.

martalist commented 1 year ago

Hi @sihanwang41, we will test the master branch and report back asap. Thank you for making this a priority on your end!

martalist commented 1 year ago

Using the Linux Python 3.10 (x86_64) build, there appears to be an exception raised when using ray timeline:

[2023-05-26 09:30:18]  INFO ray.scripts.scripts::Connecting to Ray instance at 172.24.0.11:6379.
[2023-05-26 09:30:18]  INFO ray._private.worker::Connecting to existing Ray cluster at address: 172.24.0.11:6379...
[2023-05-26 09:30:18]  INFO ray._private.worker::Connected to Ray cluster. View the dashboard at 172.24.0.11:8053
Traceback (most recent call last):
  File "/repo/.venv/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/repo/.venv/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2462, in main
    return cli()
  File "/repo/.venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/repo/.venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/repo/.venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/repo/.venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/repo/.venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/repo/.venv/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1827, in timeline
    ray.timeline(filename=filename)
  File "/repo/.venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/repo/.venv/lib/python3.10/site-packages/ray/_private/state.py", line 850, in timeline
    return state.chrome_tracing_dump(filename=filename)
  File "/repo/.venv/lib/python3.10/site-packages/ray/_private/state.py", line 446, in chrome_tracing_dump
    profile_events = self.profile_events()
  File "/repo/.venv/lib/python3.10/site-packages/ray/_private/state.py", line 218, in profile_events
    event = common_pb2.TaskEvents.FromString(task_events[i])
AttributeError: module 'ray.core.generated.common_pb2' has no attribute 'TaskEvents'

Same result when using Ray wheels master/ce16a2e82feb475f09e069905da71933a3e90654.

Unfortunately, this blocks me from testing further.

martalist commented 1 year ago

@sihanwang41 I have tested with 2.6.3. Noticing that half of the sequential get_deployment_info calls have been replaced with parallel get_num_ongoing_requests calls. Overall, for the above example, latency remains roughly the same due to (what I presume is) added overhead from handle_request_streaming.

image

There does not seem to be any way to parallelise the remaining sequential get_deployment_info calls client side. Can this be optimised within Ray Serve?