Closed fxmarty closed 3 weeks ago
Hi @fxmarty, thanks for the feature request. Any suggestion on how this information should/could be returned in the current InferenceClient framework? Open to suggestions on that.
I'm closing this issue since no new details have been provided. @fxmarty Happy to reopen it if you want. Just let me know what would be your use case for such a feature so that we can figure out what's the best way of supporting it.
@Wauplin the feature enabled by this is tracking of the response time from TGI from the client.
With stats like
x-compute-type: gpu+optimized
x-compute-time: 0.111191439
x-compute-characters: 23
x-total-time: 111
x-validation-time: 0
x-queue-time: 0
x-inference-time: 110
x-time-per-token: 36
x-prompt-tokens: 7
x-generated-tokens: 3
For example, in https://huggingface.co/spaces/fxmarty/tgi-mi300-demo-chat/blob/main/app.py, I wanted to use client.text_generation
and give these stats to the user as well, but I couldn't unless using the rest API myself. Note that TGI does not give theses stats in the generate_stream
endpoint.
@fxmarty thanks for the explanation! Any suggestion on how you would like this information to be returned in the current InferenceClient framework?
As per title, it would be helpful to be able to retrieve the header as is possible with
curl --include
.Example of a useful response header: