Open feifeibear opened 8 months ago
You make a very good point. The question you asked is a great one. The time estimated by the roofline model shows the fastest speed the hardware could possibly go. We wanted to help everyone better understand the most important things that affect how fast these huge language models (LLMs) can run on computers. So I think comparing how different things affect the time is useful.
But we have to remember that the exact numbers it predicts will always be faster than what really happens. These are the best possible times, not the real ones. So maybe we should add a note reminding everyone that this report only shows the highest it could go in a perfect world. Nothing is perfect in real life, so the real computer will always be a little slower.
The most important thing is that this tool helps us learn about what goes into making LLMs run fast or slow. Even if the exact time is off, seeing how the different parts work together can give us a good idea. And your question helped point that out - it's good to remember this just shows the limit, not what will really happen every time.
I understand that software often fails to fully utilize HBM bandwidth or max out Tensor Cores.
I'm attempting to use this project to compare the performance of an LLM inference task on two types of hardware. However, I often obtain results from LLM-Viewer that contradict the actual measurements. Please forgive me for not providing precise results, as some hardware information is confidential. So, I am currently using a very naive roofline model in my project. https://github.com/feifeibear/LLMRoofline
Therefore, I'm particularly curious about the purpose of this project:
What is the most significant effect of analyzing every Operator's AI? I understand that it could help us understand the effects of some optimizations in a more quantifiable way. Any other more practical usage for this project?
Could we establish a more precise performance model, perhaps using some sampling and fitting methods, to predict the costs of different tasks?
I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA
We can see the estimated generation throughput is higher than real results.
For the prefill time, you can see the estimated prefill time is lower than the real results.
This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.
I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .