mobiusml / aana_sdk

Aana SDK is a powerful framework for building AI enabled multimodal applications.
https://www.mobiuslabs.com/
Apache License 2.0
12 stars 1 forks source link

Add rate limiting/backoff based on SDK usage to the sync endpoints. #84

Closed ashwinnair14 closed 2 months ago

ashwinnair14 commented 4 months ago

Enhancement Description

Advantages

How is this solved normally by other projects? Add links

evanderiel commented 3 months ago

Review of four possibilities I found for access limiting with Ray Serve:

Config only: 1. Throttling: using declared deployment/machine resources. Doesn't work, only affects the number of possible deployments, not how they handle load. 2. Throttling: setting target_num_ongoing_concurrent_requests. Doesn't work, limits concurrent executions, but excess is queued instead of returning a 429. Code solutions: 3. Rate limiting: add a decorator to the deployment inference function that implements rate limiting. Doesn't work. The rate limited calls still wait for tge earlier, non-rate limited ones to complete before erroring.

  1. Rate limiting: Custom RequestHandler that implements e.g. leaky bucket algorithm. Will work, but probably the most work.
movchan74 commented 3 months ago

How do we decide when we have to refuse requests?

evanderiel commented 3 months ago

Ideal scenario would be to decide based on runtime characteristics (something like given X models, Y GPUs, and Z expected execution time, we limit to Y/X requests per Z time) or even adjust rate limits while running, but for now we'll just use manually configured values.

movchan74 commented 2 months ago

@evanderiel This is done, right? Can we close the issue?

evanderiel commented 2 months ago

More could always be done, but yes