huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
1.83k stars 471 forks source link

Ability to retrieve the protocol response headers in `InferenceClient` #2281

Closed fxmarty closed 3 weeks ago

fxmarty commented 1 month ago

As per title, it would be helpful to be able to retrieve the header as is possible with curl --include.

Example of a useful response header:

(base) felix@azure-amd-mi300-dev-01:~$ curl 0.0.0.0:80/generate -X POST -d '{"inputs":"Today I am in Paris and","parameters":{"max_new_tokens": 3, "details": true}}' -H 'Content-Type: application/json' --include
HTTP/1.1 200 OK
content-type: application/json
x-compute-type: gpu+optimized
x-compute-time: 0.111191439
x-compute-characters: 23
x-total-time: 111
x-validation-time: 0
x-queue-time: 0
x-inference-time: 110
x-time-per-token: 36
x-prompt-tokens: 7
x-generated-tokens: 3
content-length: 318
access-control-allow-origin: *
vary: origin
vary: access-control-request-method
vary: access-control-request-headers
date: Tue, 14 May 2024 12:57:59 GMT
Wauplin commented 1 month ago

Hi @fxmarty, thanks for the feature request. Any suggestion on how this information should/could be returned in the current InferenceClient framework? Open to suggestions on that.

Wauplin commented 3 weeks ago

I'm closing this issue since no new details have been provided. @fxmarty Happy to reopen it if you want. Just let me know what would be your use case for such a feature so that we can figure out what's the best way of supporting it.

fxmarty commented 3 weeks ago

@Wauplin the feature enabled by this is tracking of the response time from TGI from the client.

With stats like

x-compute-type: gpu+optimized
x-compute-time: 0.111191439
x-compute-characters: 23
x-total-time: 111
x-validation-time: 0
x-queue-time: 0
x-inference-time: 110
x-time-per-token: 36
x-prompt-tokens: 7
x-generated-tokens: 3

For example, in https://huggingface.co/spaces/fxmarty/tgi-mi300-demo-chat/blob/main/app.py, I wanted to use client.text_generation and give these stats to the user as well, but I couldn't unless using the rest API myself. Note that TGI does not give theses stats in the generate_stream endpoint.

Wauplin commented 3 weeks ago

@fxmarty thanks for the explanation! Any suggestion on how you would like this information to be returned in the current InferenceClient framework?