Open psydok opened 6 months ago
can you take a look at this @edoakes > also @psydok it might be faster to get an answer to this on #serve in ray slack; can you post there and we can continue the discussion here?
quicker to resolve for you; at first skim this reads like a serve config thing that can help improve your qps but i defer to Ed
We tried different configurations of replica load - nothing helps. still the service works faster if separate instances are started and combined in nginx with robin round balancing. can you please tell me how soon we can expect to be able to configure robin round in ray serve?
What happened + What you expected to happen
I have 2 deployments (grpc): 1 processes audio (very fast, up to 10-100ms, I don't know exactly), then sends request to inference (300-700ms). I have deployed 12 replicas. I can see from graphs in grafana that qps=5-6.
But I still get relatively many timeouts, compared to if I separately deployed 5 replicas and linked them by nginx. And in 95-percentile, requests often take up to 5 seconds to complete for some reason.
I don't understand at all how I should set the load. How do I tell the robin-round replicas to accept requests if not taking out - scaling?
Currently I have 4 servers connected, each with 2-3 gpu of different size (8-11gb).
This problem blocks all the scaling profits because the replicas end up being slow 10% of time. A lot of timeouts drop at a moment in time while the load on the service does not grow. I've tried going through different hyperparameters. Nothing works properly.
Same problem is with second application that we are trying to deploy via ray. The Yolo model converted to onnx does not produce RPS greater than 18, starting with 3 replicas, and the 95-percentile response rate is always 2 times higher under load.
Versions / Dependencies
ray==2.10.0 python==3.11.8
Reproduction script
Issue Severity
High: It blocks me from completing my task.