How can I get throughput for a generative model - Githubissues

hahnyuan / LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

MIT License

311 stars 37 forks source link

How can I get throughput for a generative model #2

Open feifeibear opened 8 months ago

feifeibear commented 8 months ago

I would like to get the throughput measured by (generated tokens)/(overall latency = prefill+decode elapse). Could you please provide an example of this?

The function analyze() dose not have a param as promp_len.

hahnyuan commented 8 months ago

I have updated a chat stage for your requirement. This is avaliable on http://llm-viewer.com/ >0.3.5 version.

feifeibear commented 8 months ago

Could you tell me how to use it in my code? What is the differences with this PR #1 ?

hahnyuan commented 8 months ago

PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.

feifeibear commented 8 months ago

PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.

Thanks, could you provide an API in the codebase to use it using python?

feifeibear commented 8 months ago

If using your latest webview, we can see the latency for bs=64, in=512, out=512 is 8.2s

However, if I use the analyze_generate_task() API the latency is over 11s. nvidia_A100_80G: 1st token latency 1.4909548877801777, total latency 11.478970542701218, throughput 2854.6113850632005 Token/sec

My code is here #3

feifeibear commented 8 months ago

I have fixed the bug, the inconsistency between webview and cmd comes from the use_flash_attn flag.