Open avianion opened 5 months ago
How did you find it's including the latency? Looking at the code, the behaviour and statements like these, to me it looks like it's already only counting the time of actual generation. It's not including the time to first token.
From the logs, where multiple separate timings are shown for prompt evaluation, token generation (eval time) and total time, only the second is used to display tokens/s
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: prompt eval time = 119.744ms / 33 tokens (3.62860606061 ms per token, 275.587920898 tokens per second) - context/llama_server_context.h:448
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: eval time = 339.311 ms / 17 runs (19.9594705882 ms per token, 50.1015292755 tokens per second)
- context/llama_server_context.h:455
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: total time = 459.055 ms - context/llama_server_context.h:462
Last time I checked the UI, what I saw was that 50 t/s figure.
The eval time is calculated including the initial time of response from the server. Sent from my iPhone
On 18 May 2024, at 15:24, Propheticus @.***> wrote:
How did you find it's including the latency? Looking at the code, the behaviour and statements like thesehttps://github.com/janhq/jan/issues/2636#issuecomment-2041934155, to me it looks like it's already only counting the time of actual generation. It's not including the time to first token.
From the logs, where multiple separate timings are shown for prompt evaluation, token generation (eval time) and total time, only the second is used to display tokens/s
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: prompt eval time = 119.744ms / 33 tokens (3.62860606061 ms per token, 275.587920898 tokens per second) - context/llama_server_context.h:448 20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: eval time = 339.311 ms / 17 runs (19.9594705882 ms per token, 50.1015292755 tokens per second)
Last time I checked the UI, what I saw was that 50 t/s figure.
— Reply to this email directly, view it on GitHubhttps://github.com/janhq/cortex.cpp/issues/1099, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI4UWH6OK7WF3UBZS3JCS7DZC5QCFAVCNFSM6AAAAABH47OSAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYHA2DAMZTHE. You are receiving this because you authored the thread.Message ID: @.***>
It is not. The eval time is the Llama.cpp reported eval time for generating the tokens. The time to response is sometimes several seconds, and still the tok/s value remains ~47-50. Adding a full second of time to first response to the eval time would result in a drastically lower figure of around 13 t/s. This is not a figure I see in the GUI.
Related #985
Describe the bug Tokens per second is currently calculated including the latency since the beginning of the API request and or hitting the start button.
However, tokens per second should be calculated like this
(Total tokens) / (Time to last token - Time to first token)
Steps to reproduce Steps to reproduce the behavior:
Use jan.ai and observe the token per second counting behaviour is wrong
Expected behavior (Total tokens) / (Time to last token - Time to first token)
Screenshots N/a
Environment details
Logs If the cause of the error is not clear, kindly provide your usage logs: https://jan.ai/docs/troubleshooting#how-to-get-error-logs
Additional context Add any other context or information that could be helpful in diagnosing the problem.