Add wait-for-model header when sending request to Inference API

Wauplin commented 3 weeks ago

Should fix https://github.com/huggingface/huggingface_hub/issues/2175.

In the current implementation, InferenceClient sends a request every 1s as long as the model is unavailable (HTTP 503). This can lead users to be rate limited even though they don't consume the API (reported here). This PR adds "X-wait-for-model": "1" as header which tell the server to wait for the model to be loaded before returning a response. This way the client doesn't make calls every X seconds for nothing. This X-wait-for-model header is added only when requesting the serverless Inference API.

EDIT: based on @Narsil's comment, header is added to the request only on the second call. This way, user don't reach the rate limit but we are still able to log a message to tell the user the model is not loaded yet.

cc @Narsil (from private slack thread)

HuggingFaceDocBuilderDev commented 3 weeks ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wauplin commented 3 weeks ago

What about doing 1 query without the wait, and only adding the way after the first retry ?

Nice idea! Implemented it in https://github.com/huggingface/huggingface_hub/pull/2318/commits/8ea3f1d16eb225e6f46308e568f907be2b699ee2

Wauplin commented 2 weeks ago

Thanks for the review!

huggingface / huggingface_hub

Add wait-for-model header when sending request to Inference API #2318