Open jppgks opened 7 months ago
Have the (async) client automatically backoff sending requests when the deployment is overloaded.
When the async client exceeds the deployment queue capacity / rate limits, it currently fails with
OverloadedError: Model is overloaded
Current thinking is to add more examples for doing batch inference to avoid this issue on the client. We can still look for ways to backoff automatically, though, in the future.
Feature request
Have the (async) client automatically backoff sending requests when the deployment is overloaded.
Motivation
When the async client exceeds the deployment queue capacity / rate limits, it currently fails with