Separate out `model` and `base_url` in `InferenceClient`

anubhavrana commented 1 month ago

Is your feature request related to a problem? Please describe.

When using the InferenceClient we can use the model argument to specify a model id on the HuggingFace Hub or a URL to a custom deployed Inference Endpoint. This generally works well but we have an API gateway sitting in front of our text-generation-inference servers and the gateway has additional plugins and routing based on the model_id.

Our plugins are currently only exposed for the openai v1/chat/completions route and this issue is not present when using the native text-generation-inference routes, however the change may be good to have for both.

The way our plugin works is that we provide a base API URL and based on the model_id in the chat_completions method we route requests to the relevant model endpoint.

When trying to use the InferenceClient with our gateway we run into 404 responses due to gateway not being able to find the routes (since model was set to "tgi")

Describe the solution you'd like

Separation of model and base_url will help solve the issue. Furthermore, if we just use base_url when instantiating InferenceClient and use model in the different methods (like chat_completions) we can also follow a similar pattern to the openai client. This should also not affect or break the current usage pattern and we can have default values similarly to the current implementation.

Describe alternatives you've considered

We are able to circumvent this issue when using the native text-generation-inference routes by specifying headers in InferenceClient which works since our plugin is not enabled on those routes, but for routes that have the plugin, adding the appropriate headers still fails since the plugin injects the provided model_id (in the current case a random string "tgi") and overrides the provided header.

Additional context

I'd be happy to create a PR for this. I've started working on a possible fix and would love to collaborate!

Wauplin commented 1 month ago

Hi @anubhavrana, so if I understand correctly, you would like to provide a custom model_id specifically for the chat_completion method to override the dummy value sent to the server? To do that, I don't think we should use the model parameter from both InferenceClient and chat_completion to provide the url+the model id, as it would be misleading. I would only add an optional model_id: Optional[str] = None parameter to chat_completion and use it.

However, before moving forward on this I'd like to understand what's your issue with using headers to perform your goal using the current implementation? Your router/gateway should be able to read the request headers to route the request. For example you can do InferenceClient(model="TGI URL", headers={"X-Model-Id": <your_model_id>}).chat_completion(...). Is there a problem with this solution? I found it unusual to rely the body payload to route requests.

anubhavrana commented 1 month ago

Hi @anubhavrana, so if I understand correctly, you would like to provide a custom model_id specifically for the chat_completion method to override the dummy value sent to the server? To do that, I don't think we should use the model parameter from both InferenceClient and chat_completion to provide the url+the model id, as it would be misleading. I would only add an optional model_id: Optional[str] = None parameter to chat_completion and use it.

Yes that's exactly it, and that makes sense and agreed that it shouldn't be added to the InferenceClient.

However, before moving forward on this I'd like to understand what's your issue with using headers to perform your goal using the current implementation? Your router/gateway should be able to read the request headers to route the request. For example you can do InferenceClient(model="TGI URL", headers={"X-Model-Id": <your_model_id>}).chat_completion(...). Is there a problem with this solution? I found it unusual to rely the body payload to route requests.

Yes our gateway reads the request headers and uses that to route it's requests. What we have done with the gateway is to separate out the tgi and openai routes. For the tgi routes doing what you proposed is how we make requests to the gateway and it works well. When the gateway receives requests for the openai routes, a custom plugin retrieves the model parameter from the request body and adds that to the request as a header in the same way as the tgi routes.

Our users use the openai routes with the openai client, where when the chat.completions.create method is called the model argument is provided and then used by the gateway plugin to add the headers. I would like to be able to do the same thing with the InferenceClient using the openai routes to provide that flexibility as well.

I hope that helped clarify the problem. Open to other solutions too.

Wauplin commented 1 month ago

Ok, makes sense. I opened https://github.com/huggingface/huggingface_hub/pull/2302 to support this :)

anubhavrana commented 1 month ago

Thank you so much! @Wauplin

Wauplin commented 3 weeks ago

Closed by https://github.com/huggingface/huggingface_hub/pull/2302. Will be available in next release :)

huggingface / huggingface_hub

Separate out `model` and `base_url` in `InferenceClient` #2293