Open vovochka404 opened 7 months ago
Hey @vovochka404, thanks for filing the issue; I can understand the frustration here. It's pretty unlikely that we will bring back the old replica scheduling technique because it had some fundamental incompatibilities with the API, i.e., max_concurrent_queries
was enforced per-caller instead of per-replica which was very confusing and led to unintuitive & hard-to-configure autoscaling behaviors.
I have been working to improve the efficiency of the new scheduling technique to reduce the overheads that you mentioned by adding caching that reduces the number of RTTs in the fast path to be equivalent to the old technique: https://github.com/ray-project/ray/pull/42943. This will come out in the upcoming Ray 2.10 release (branch cut is tomorrow, so optimistically out by end of next week), but of course you can test it out using the nightly wheels.
Please give it a go and let me know if it makes a difference for you. I believe we are also going to allocate time for the 2.11 release to reduce the overheads in the proxy in general which should provide more benefit.
I finally managed to test the update to 2.10
Production and testing configurations are quite similar, but testing is running 2.10 version, while production still uses 2.7.1. It is about 10k rpm at peak.
And as it's seen here: one of the problems is bottleneck at ProxyActor
I finally managed to test the update to 2.10
Production and testing configurations are quite similar, but testing is running 2.10 version, while production still uses 2.7.1. It is about 10k rpm at peak.
It seems to me that the performance looks similar until the huge latency spikes that happen in the testing version around 14:00 and 15:00. Do you have any sense of what happened there? Was there a change in the traffic pattern and/or did you observe any errors in the logs?
This is caused by small spikes in the number of requests. At this level service with 2.10 cannot hold this load.
This is caused by small spikes in the number of requests. At this level service with 2.10 cannot hold this load.
And do you see any warnings such as these in the logs?
483 logger.warning(
484 f"Replica at capacity of max_ongoing_requests={limit}, "
485 f"rejecting request {request_metadata.request_id}.",
486 extra={"log_to_stderr": False},
487 )
This would indicate that the replicas are at capacity and might increase load/tail latency on the proxy. I wouldn't expect this to cause latencies to spike as they are, but it might indicate where the issue lies.
Description
In our use case, ray-cluster is used for a high-load personal recommendation system.
But now we are stuck with version 2.7.1 since in further versions the RoundRobin implementation of the ReplicaScheduler was removed. PowerOfTwoChoicesReplicaScheduler with our request flow turns HTTPProxyActor into a bottleneck, since for every request it needs to make 3! remote requests to replicas.
Maybe it’s worth returning the optional ability to choose which scheduler to use, and not be tied only to specific use cases?
Use case
No response