ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.57k stars 5.7k forks source link

[Serve] Bring back optional RoundRobinReplicaScheduler #43549

Open vovochka404 opened 7 months ago

vovochka404 commented 7 months ago

Description

In our use case, ray-cluster is used for a high-load personal recommendation system.

But now we are stuck with version 2.7.1 since in further versions the RoundRobin implementation of the ReplicaScheduler was removed. PowerOfTwoChoicesReplicaScheduler with our request flow turns HTTPProxyActor into a bottleneck, since for every request it needs to make 3! remote requests to replicas.

Maybe it’s worth returning the optional ability to choose which scheduler to use, and not be tied only to specific use cases?

Use case

No response

edoakes commented 7 months ago

Hey @vovochka404, thanks for filing the issue; I can understand the frustration here. It's pretty unlikely that we will bring back the old replica scheduling technique because it had some fundamental incompatibilities with the API, i.e., max_concurrent_queries was enforced per-caller instead of per-replica which was very confusing and led to unintuitive & hard-to-configure autoscaling behaviors.

I have been working to improve the efficiency of the new scheduling technique to reduce the overheads that you mentioned by adding caching that reduces the number of RTTs in the fast path to be equivalent to the old technique: https://github.com/ray-project/ray/pull/42943. This will come out in the upcoming Ray 2.10 release (branch cut is tomorrow, so optimistically out by end of next week), but of course you can test it out using the nightly wheels.

Please give it a go and let me know if it makes a difference for you. I believe we are also going to allocate time for the 2.11 release to reduce the overheads in the proxy in general which should provide more benefit.

vovochka404 commented 6 months ago

I finally managed to test the update to 2.10

Снимок экрана 2024-04-24 в 16 17 11

Production and testing configurations are quite similar, but testing is running 2.10 version, while production still uses 2.7.1. It is about 10k rpm at peak.

vovochka404 commented 6 months ago

And as it's seen here: one of the problems is bottleneck at ProxyActor

Снимок экрана 2024-04-24 в 16 25 26
edoakes commented 6 months ago

I finally managed to test the update to 2.10 Снимок экрана 2024-04-24 в 16 17 11

Production and testing configurations are quite similar, but testing is running 2.10 version, while production still uses 2.7.1. It is about 10k rpm at peak.

It seems to me that the performance looks similar until the huge latency spikes that happen in the testing version around 14:00 and 15:00. Do you have any sense of what happened there? Was there a change in the traffic pattern and/or did you observe any errors in the logs?

vovochka404 commented 6 months ago

This is caused by small spikes in the number of requests. At this level service with 2.10 cannot hold this load.

Снимок экрана 2024-04-25 в 10 38 10
edoakes commented 6 months ago

This is caused by small spikes in the number of requests. At this level service with 2.10 cannot hold this load. Снимок экрана 2024-04-25 в 10 38 10

And do you see any warnings such as these in the logs?

 483             logger.warning(
 484                 f"Replica at capacity of max_ongoing_requests={limit}, "
 485                 f"rejecting request {request_metadata.request_id}.",
 486                 extra={"log_to_stderr": False},
 487             )

This would indicate that the replicas are at capacity and might increase load/tail latency on the proxy. I wouldn't expect this to cause latencies to spike as they are, but it might indicate where the issue lies.