Closed ashwinnair14 closed 2 months ago
Review of four possibilities I found for access limiting with Ray Serve:
Config only:
1. Throttling: using declared deployment/machine resources. Doesn't work, only affects the number of possible deployments, not how they handle load.
2. Throttling: setting Doesn't work, limits concurrent executions, but excess is queued instead of returning a 429.
Code solutions:
target_num_ongoing_concurrent_requests
.3. Rate limiting: add a decorator to the deployment inference function that implements rate limiting. Doesn't work. The rate limited calls still wait for tge earlier, non-rate limited ones to complete before erroring.
How do we decide when we have to refuse requests?
Ideal scenario would be to decide based on runtime characteristics (something like given X models, Y GPUs, and Z expected execution time, we limit to Y/X requests per Z time) or even adjust rate limits while running, but for now we'll just use manually configured values.
@evanderiel This is done, right? Can we close the issue?
More could always be done, but yes
Enhancement Description
Advantages
How is this solved normally by other projects? Add links