hahnyuan / LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
MIT License
311 stars 37 forks source link

The error between LLM-viewer predicted results and TensorRT-LLM real performance is large. #4

Open feifeibear opened 8 months ago

feifeibear commented 8 months ago

I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA

We can see the estimated generation throughput is higher than real results.

          TRT-LLM LLM-view
Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s/GPU) est throughput
LLaMA 7B 256 1 128 128 5,353 8,934.54
LLaMA 7B 32 1 128 2048 1,518 2,796.58
LLaMA 7B 32 1 2048 128 547 788.73
LLaMA 7B 16 1 2048 2048 613 1,169.17

For the prefill time, you can see the estimated prefill time is lower than the real results.

        TensorRT-LLM   LLM-view
  bs tp input 1st latency est 1st latencty (sec) est (ms)
LLaMA 7B 1 1 128 16.1 0.006977 6.976999894
LLaMA 7B 1 1 2048 120.5 0.10088071 100.88071

This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.

I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .

hahnyuan commented 8 months ago

You make a very good point. The question you asked is a great one. The time estimated by the roofline model shows the fastest speed the hardware could possibly go. We wanted to help everyone better understand the most important things that affect how fast these huge language models (LLMs) can run on computers. So I think comparing how different things affect the time is useful.

But we have to remember that the exact numbers it predicts will always be faster than what really happens. These are the best possible times, not the real ones. So maybe we should add a note reminding everyone that this report only shows the highest it could go in a perfect world. Nothing is perfect in real life, so the real computer will always be a little slower.

The most important thing is that this tool helps us learn about what goes into making LLMs run fast or slow. Even if the exact time is off, seeing how the different parts work together can give us a good idea. And your question helped point that out - it's good to remember this just shows the limit, not what will really happen every time.

feifeibear commented 8 months ago

I understand that software often fails to fully utilize HBM bandwidth or max out Tensor Cores.

I'm attempting to use this project to compare the performance of an LLM inference task on two types of hardware. However, I often obtain results from LLM-Viewer that contradict the actual measurements. Please forgive me for not providing precise results, as some hardware information is confidential. So, I am currently using a very naive roofline model in my project. https://github.com/feifeibear/LLMRoofline

Therefore, I'm particularly curious about the purpose of this project:

  1. What is the most significant effect of analyzing every Operator's AI? I understand that it could help us understand the effects of some optimizations in a more quantifiable way. Any other more practical usage for this project?

  2. Could we establish a more precise performance model, perhaps using some sampling and fitting methods, to predict the costs of different tasks?