The error between LLM-viewer predicted results and TensorRT-LLM real performance is large.

feifeibear commented 8 months ago

I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA

We can see the estimated generation throughput is higher than real results.

					TRT-LLM	LLM-view
Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s/GPU)	est throughput
LLaMA 7B	256	1	128	128	5,353	8,934.54
LLaMA 7B	32	1	128	2048	1,518	2,796.58
LLaMA 7B	32	1	2048	128	547	788.73
LLaMA 7B	16	1	2048	2048	613	1,169.17

For the prefill time, you can see the estimated prefill time is lower than the real results.

				TensorRT-LLM		LLM-view
	bs	tp	input	1st latency	est 1st latencty (sec)	est (ms)
LLaMA 7B	1	1	128	16.1	0.006977	6.976999894
LLaMA 7B	1	1	2048	120.5	0.10088071	100.88071

This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.

I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .

hahnyuan commented 8 months ago

You make a very good point. The question you asked is a great one. The time estimated by the roofline model shows the fastest speed the hardware could possibly go. We wanted to help everyone better understand the most important things that affect how fast these huge language models (LLMs) can run on computers. So I think comparing how different things affect the time is useful.

But we have to remember that the exact numbers it predicts will always be faster than what really happens. These are the best possible times, not the real ones. So maybe we should add a note reminding everyone that this report only shows the highest it could go in a perfect world. Nothing is perfect in real life, so the real computer will always be a little slower.

The most important thing is that this tool helps us learn about what goes into making LLMs run fast or slow. Even if the exact time is off, seeing how the different parts work together can give us a good idea. And your question helped point that out - it's good to remember this just shows the limit, not what will really happen every time.

feifeibear commented 8 months ago

I understand that software often fails to fully utilize HBM bandwidth or max out Tensor Cores.

I'm attempting to use this project to compare the performance of an LLM inference task on two types of hardware. However, I often obtain results from LLM-Viewer that contradict the actual measurements. Please forgive me for not providing precise results, as some hardware information is confidential. So, I am currently using a very naive roofline model in my project. https://github.com/feifeibear/LLMRoofline

Therefore, I'm particularly curious about the purpose of this project:

What is the most significant effect of analyzing every Operator's AI? I understand that it could help us understand the effects of some optimizations in a more quantifiable way. Any other more practical usage for this project?
Could we establish a more precise performance model, perhaps using some sampling and fitting methods, to predict the costs of different tasks?

hahnyuan / LLM-Viewer

The error between LLM-viewer predicted results and TensorRT-LLM real performance is large. #4