Closed acane77 closed 5 days ago
Please make sure your performance data are the same format. For example, python transformers benchmark
's 1st is total time of 1st token, 2nd is ms per token. What's the format of llama-bench? It looks like number of tokens per second.
I just notice llama-bench's format is prefill (t/s) | decode (t/s) |
For 365 tokens, 1st total time is 365 / 479 = 0.792 s, 2nd is 1000 / 27.6 = 36.23 ms/token. It shows python transformers
is a little faster.
Yes, we also noticed this. We also tried different configuations (batch size, ubatch size, thread numbers), but all these performance are lower than the transformers
results.
Background
We evaluate the performance with llama-bench from ipex-llm[cpp] and the benchmark script , to compare with the benchmark results from this image.
We found the benchmark script, which use transformers pipeline and pytorch backend achieves better performance than using
llama-bench
(llama-bench evaluate the prefill and decode speed repesctively and no sampling during decoding at all, it should have been faster than normal LLM generate pipeline).We run the benchmarks on Ubuntu 22.04 and Intel Ultra 7 155H.
The steps and our results
The llama-bench (the original version) results:
As mentioned above, first, we made some modification to llama-bench, to make it run decode after prefilling, and show prefill and decode speed respectively.
Code at here: https://github.com/acane77/llama.cpp/tree/dev_ipex_mod
We build the llama-bench with the following scripts
where the
~/projects/llama-cpp/llama-bench-emb
is created byllama-init
, and the libs are linked to ipex-llm venv.Then, we run this llama-bench-emb (our modified version), the results is as following.
While the following table is generated by ipex benchmark script.
Questions
Is there any reason for this significant performance gap between the python transformers benchmark and llama-bench?
The difference is that the pytorch benchmark uses
xpu
device while llama-bench usesgpu
.