Closed Wauplin closed 2 weeks ago
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
What about doing 1 query without the wait, and only adding the way after the first retry ?
Nice idea! Implemented it in https://github.com/huggingface/huggingface_hub/pull/2318/commits/8ea3f1d16eb225e6f46308e568f907be2b699ee2
Thanks for the review!
Should fix https://github.com/huggingface/huggingface_hub/issues/2175.
In the current implementation,
InferenceClient
sends a request every 1s as long as the model is unavailable (HTTP 503). This can lead users to be rate limited even though they don't consume the API (reported here). This PR adds"X-wait-for-model": "1"
as header which tell the server to wait for the model to be loaded before returning a response. This way the client doesn't make calls every X seconds for nothing. ThisX-wait-for-model
header is added only when requesting the serverless Inference API.EDIT: based on @Narsil's comment, header is added to the request only on the second call. This way, user don't reach the rate limit but we are still able to log a message to tell the user the model is not loaded yet.
cc @Narsil (from private slack thread)