Open jegork opened 4 days ago
Hey, based on your logs I think this is expected behavior.
The output of your curl
for /v1//chat/completions
reports 14
completion tokens. Based on your logs for the 1st request you have: "time_per_token":"18.340269ms"
; so ~14*18.3=256.2ms
(which is close to what you see client-side and close to the total inference_time
reported).
The second request for /generate
seem to be defaulting to max_new_tokens: Some(100)
. Based on your logs for the 2nd request you have "time_per_token":"17.175441ms"
; so ~100*17.2=1,720ms
(which also in this case is close to what you see client-side and close to the total `inference_time reported).
You should be able to get comparable timings if you explicitly set max_new_tokens
(for /generate
) and max_tokens
(for /v1/chat/completion
).
@claudioMontanari indeed, the time per token is the same But setting the maximum number of tokens to 256 (for both endpoint calls) yields me same 0.3-0.4s and 1.8s-1.9s latency
System Info
Running docker image version 2.4.0 with eetq quantization
Model: microsoft/Phi-3.5-mini-instruct
Hardware: Google Kubernetes engine, L4 GPU
Information
Tasks
Reproduction
phi_body.json
phi_generate_body.json
Similar times are reported in the logs
Expected behavior
https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/consuming_tgi
Based on this docs page it seems like the two endpoints should be identical, but there is a large difference in results and inference time.