[Serve] performance bottlenecked by the ProxyActor

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.14k stars 5.61k forks source link

[Serve] performance bottlenecked by the ProxyActor #42565

Open ShenJiahuan opened 8 months ago

ShenJiahuan commented 8 months ago

What happened + What you expected to happen

Issue Summary

We've identified a potential performance bottleneck within Ray Serve due to the limitation of having a single ProxyActor per node. This architecture may be hindering scalability and maximum request handling capacity.

Performance Test Details

To evaluate the performance implications, we conducted a test using the following setup:

Server Specification: 2 * 12 core Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz.
Test Code: see below.

The test results indicated that the system is currently capped at approximately 70 requests per second.

Source Code Review and Concerns

Upon reviewing the Ray source code, it has become evident that the design choice of a single ProxyActor per node is responsible for handling all incoming requests, replica selection, and other associated tasks. This centralized approach is likely what prevents Ray Serve from scaling effectively with the available hardware resources.

Proposed Discussion Points

Is there a particular reason for the current design with a single ProxyActor?
Would it be feasible to consider a design that allows for multiple ProxyActors per node to distribute the load?
Has this issue been observed by others, and are there any known workarounds or plans for future updates to address this?

I believe addressing this bottleneck could significantly improve Ray Serve's performance and scalability. I look forward to the community's input and potential solutions.

Versions / Dependencies

Ray 2.9.1

Reproduction script

Test Code Snippet

To reproduce the performance issue, please use the following code which sets up a Ray Serve deployment with 48 replicas:

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment(
    num_replicas=48,
    route_prefix="/",
)
class TestDeployment:
    async def __call__(self, request: Request):
        return "hi"

def model(args):
    return TestDeployment.bind()

Starting Ray Serve

Ray Serve is initiated with the following command:

serve run test:model

Load Testing

For stressing the service and measuring the throughput, we are using ApacheBench with the following command:

ab -n 1000 -c 100 http://127.0.0.1:8000/

Issue Severity

High: It blocks me from completing my task.

edoakes commented 8 months ago

@ShenJiahuan thanks for filing the issue! We're aware of the proxy as the bottleneck for each node. The reason for this design decision is because the proxy includes quite a bit of intelligence to handle dynamic service discovery/routing, features like model multiplexing, and handling edge cases like properly draining requests upon node removal (e.g., spot instance interruption).

We've done some work to ensure performance is within a reasonable bound, but we're aware that this may be a dealbreaker for some use cases. Likely our chosen path to improve this bottleneck would be to optimize the proxy performance (reduce overhead from Ray actor calls, consider implementing it in a faster language like C++) rather than rearchitect the system to remove it.

However, the 70 QPS number you've cited here is dramatically lower than we see in our microbenchmarks. I re-ran this benchmark on my laptop (2021 MacBook Pro with M1 max) on the master branch and saw ~800qps:

(ray) ➜  ray git:(master) ✗ ab -n 10000 -c 100 http://127.0.0.1:8000/

This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        uvicorn
Server Hostname:        127.0.0.1
Server Port:            8000

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      100
Time taken for tests:   12.200 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      2060000 bytes
HTML transferred:       120000 bytes
Requests per second:    819.64 [#/sec] (mean)
Time per request:       122.005 [ms] (mean)
Time per request:       1.220 [ms] (mean, across all concurrent requests)
Transfer rate:          164.89 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.6      0       4
Processing:    17  121  24.2    115     214
Waiting:       15  117  24.3    112     210
Total:         18  121  24.2    116     215
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%    116
  66%    123
  75%    130
  80%    137
  90%    157
  95%    174
  98%    184
  99%    192
 100%    215 (longest request)

What instance type did you run the benchmark on? Are you able to re-run it on one that is publicly available such as an AWS instance type?

ShenJiahuan commented 8 months ago

@edoakes, thank you for the quick and detailed response!

Upon reevaluating our setup, I discovered a misconfiguration on our end that constrained the performance. 🥲 After addressing this, I am now observing a throughput of approximately 280 QPS on our corrected setup.

Furthermore, I conducted a test on an AWS t3.2xlarge instance (8-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz), which yielded around 175 QPS. While this is a marked improvement over the initial 70 QPS, it seems that we could still encounter scalability limitations under high-load scenarios.

xjhust commented 7 months ago

@edoakes, thank you for the quick and detailed response!

Upon reevaluating our setup, I discovered a misconfiguration on our end that constrained the performance. 🥲 After addressing this, I am now observing a throughput of approximately 280 QPS on our corrected setup.

Furthermore, I conducted a test on an AWS t3.2xlarge instance (8-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz), which yielded around 175 QPS. While this is a marked improvement over the initial 70 QPS, it seems that we could still encounter scalability limitations under high-load scenarios.

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS，can you tell me the misconfiguration adjust to improve the performance, thanks

FlyTOmeLight commented 7 months ago

@edoakes Hi! I would like to know what factors might affect the performance level of ProxyActor, such as the configuration of the server or the complexity of the application. If there are multiple nodes, is the improvement linear? We are hoping to use RayServe in a production environment. Given our current observations, this performance issue is very important to us. Do you know when there might be significant progress on this?

ShenJiahuan commented 7 months ago

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS，can you tell me the misconfiguration adjust to improve the performance, thanks

@xjhust, a throughput of 220 QPS appears to be reasonable and does not necessarily indicate a misconfiguration, especially when compared with my subsequent experiments. The performance is closely tied to the single-core performance of the CPU. The 800 QPS achieved by @edoakes is likely due to the superior single-core performance of the M1 Max. In fact, our server's CPU was previously set to 'powersave' mode, which resulted in a reduced performance of approximately 70 QPS.

edoakes commented 7 months ago

I meet this issue in my task too, and I re-run the script you provide, the throughput is approximately 220 QPS，can you tell me the misconfiguration adjust to improve the performance, thanks

@xjhust, a throughput of 220 QPS appears to be reasonable and does not necessarily indicate a misconfiguration, especially when compared with my subsequent experiments. The performance is closely tied to the single-core performance of the CPU. The 800 QPS achieved by @edoakes is likely due to the superior single-core performance of the M1 Max. In fact, our server's CPU was previously set to 'powersave' mode, which resulted in a reduced performance of approximately 70 QPS.

Yes this makes sense given that the proxy is currently a Python process which is largely single-threaded.

edoakes commented 7 months ago

@edoakes Hi! I would like to know what factors might affect the performance level of ProxyActor, such as the configuration of the server or the complexity of the application. If there are multiple nodes, is the improvement linear? We are hoping to use RayServe in a production environment. Given our current observations, this performance issue is very important to us. Do you know when there might be significant progress on this?

The overall throughput scaling is ~linear as you add more nodes to the cluster as each node will have its own proxy (and they scale independently).

We are making incremental progress to this all the time (example), but I wouldn't expect orders-of-magnitude improvement in the near future (i.e., the next few releases). Most of our users' workloads involve fairly heavyweight ML inference requests, so supporting very high QPS per node is not often the primary goal. So we are instead focusing our efforts on other issues like improved autoscaling, stability, and observability.

decadance-dance commented 4 months ago

Any updates? I go with ray serve but faced my model infers 0.5s but a __call__ func takes 1s so it is x2. I saw @edoakes mentioned that they are focusing on other things but I think that it is also a very important aspect because there is no a lot of alternatives (bentoml - very slow, triton - so buggy) to run models to production. So It'd great if devs like me can get performance improvements.

edoakes commented 4 months ago

Any updates? I go with ray serve but faced my model infers 0.5s but a __call__ func takes 1s so it is x2. I saw @edoakes mentioned that they are focusing on other things but I think that it is also a very important aspect because there is no a lot of alternatives (bentoml - very slow, triton - so buggy) to run models to production. So It'd great if devs like me can get performance improvements.

@decadance-dance given the limited information you have provided, I doubt that what you're seeing is the same issue described in this ticket. If you get in touch on the Ray Slack or file a separate issue and provide more details about your setup, we may be able to point out some improvements.

Superskyyy commented 2 months ago

Working on an investigation, will update here from our benchmark.

Superskyyy commented 2 months ago

I tested on a 16-core linux machine with i9-11900K, it might be a bit overpowered so I'm moved the test to a typical cloud baremetal 2.6GHZ. (See results in later part, both tests are based on 2.32.0)

11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
    CPU family:          6
    Model:               167
    Thread(s) per core:  2
    Core(s) per socket:  8

Fastapi with 1/48 workers:

Cloud Results: Cloud baremetal machine with 128 cores, CPU 2.6GHz:

I get ~180 qps, latency 500ms-1000ms, which is pretty low.

Serve:

FastAPI: