Open feifeibear opened 8 months ago
I have updated a chat
stage for your requirement. This is avaliable on http://llm-viewer.com/ >0.3.5 version.
Could you tell me how to use it in my code? What is the differences with this PR #1 ?
PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.
PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.
Thanks, could you provide an API in the codebase to use it using python?
If using your latest webview, we can see the latency for bs=64, in=512, out=512 is 8.2s
However, if I use the analyze_generate_task() API the latency is over 11s. nvidia_A100_80G: 1st token latency 1.4909548877801777, total latency 11.478970542701218, throughput 2854.6113850632005 Token/sec
My code is here #3
I have fixed the bug, the inconsistency between webview and cmd comes from the use_flash_attn flag.
I would like to get the throughput measured by (generated tokens)/(overall latency = prefill+decode elapse). Could you please provide an example of this?
The function analyze() dose not have a param as promp_len.