Async client to backoff when model overloaded

predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

https://loraexchange.ai

Apache License 2.0

2.18k stars 143 forks source link

Async client to backoff when model overloaded #412

Open jppgks opened 7 months ago

jppgks commented 7 months ago

Feature request

Have the (async) client automatically backoff sending requests when the deployment is overloaded.

Motivation

When the async client exceeds the deployment queue capacity / rate limits, it currently fails with

OverloadedError: Model is overloaded

tgaddair commented 5 months ago

Current thinking is to add more examples for doing batch inference to avoid this issue on the client. We can still look for ways to backoff automatically, though, in the future.