Closed shouyiwang closed 1 year ago
This is what I get for ggml-vicuna-13B-1.1-q4_0.bin using exactly your commands
llama_print_timings: sample time = 136,68 ms / 407 runs ( 0,34 ms per token) llama_print_timings: prompt eval time = 177,06 ms / 9 tokens ( 19,67 ms per token) llama_print_timings: eval time = 10170,22 ms / 406 runs ( 25,05 ms per token)
llama_print_timings: sample time = 137.41 ms / 407 runs ( 0.34 ms per token) llama_print_timings: prompt eval time = 175.21 ms / 9 tokens ( 19.47 ms per token) llama_print_timings: eval time = 9979.90 ms / 406 runs ( 24.58 ms per token)
Performance is exactly the same on my system.
The original bug report was posted https://github.com/abetlen/llama-cpp-python/issues/398, but it seems like they don't care.
I get similar accusations myself. Most likely, he simply doesn't have time to reply to every single issue (as I do not for the issues here). llama-cpp-python is a well documented project with 693 commits. We should be grateful for the author's effort.
@oobabooga It is weird.
After reading your comment, I rented a machine from runpod.io (RTX 3090, EPYC 7502, Ubuntu). Strictly following the procedure I listed above, I installed the test environment and conducted the test multiple times.
My results for guanaco-33B.ggmlv3.q4_K_M.bin:
Llama.cpp:
llama_print_timings: load time = 3568.69 ms
llama_print_timings: sample time = 846.83 ms / 1000 runs ( 0.85 ms per token)
llama_print_timings: prompt eval time = 526.94 ms / 9 tokens ( 58.55 ms per token)
llama_print_timings: eval time = 58901.63 ms / 999 runs ( 58.96 ms per token)
llama_print_timings: total time = 60591.11 ms
Llama-cpp-python:
llama_print_timings: load time = 530.60 ms
llama_print_timings: sample time = 877.34 ms / 1000 runs ( 0.88 ms per token)
llama_print_timings: prompt eval time = 530.55 ms / 9 tokens ( 58.95 ms per token)
llama_print_timings: eval time = 63287.93 ms / 999 runs ( 63.35 ms per token)
llama_print_timings: total time = 77040.04 ms
Llama-cpp-python takes 16.5 seconds longer to run.
Then I downloaded your test model: ggml-vic13b-q4_0.bin from here:
Llama.cpp:
llama_print_timings: load time = 1808.81 ms
llama_print_timings: sample time = 347.80 ms / 407 runs ( 0.85 ms per token)
llama_print_timings: prompt eval time = 197.25 ms / 9 tokens ( 21.92 ms per token)
llama_print_timings: eval time = 10216.32 ms / 406 runs ( 25.16 ms per token)
llama_print_timings: total time = 10881.35 ms
Llama-cpp-python:
llama_print_timings: load time = 199.19 ms
llama_print_timings: sample time = 355.34 ms / 407 runs ( 0.87 ms per token)
llama_print_timings: prompt eval time = 199.14 ms / 9 tokens ( 22.13 ms per token)
llama_print_timings: eval time = 10455.50 ms / 406 runs ( 25.75 ms per token)
llama_print_timings: total time = 13567.84 ms
With a much smaller model, Llama-cpp-python takes 2.7 seconds longer to run.
I was annoyed that nobody replied to me in 2 days. Sorry about that if you found my words to be offensive.
@oobabooga
The discrepancy in timing between llama.cpp and llama-cpp-python is not related to the three lines you mentioned.
The issue lies in the total time, which you did not include in your reply.
In llama-cpp-python, the total time is significantly longer than the sum of the time listed by itself above total time.
total time != sample time + prompt eval time + eval time
That is the problem we need to address.
Another way to check the difference is by using a smartphone as a timer. Just start the timer when you press the "return" key on the keyboard, and then compare the total execution time of the two.
Is this not just a function of the handoff between C++ land and Python land? It's pretty common to have bottleneck when calling functions from one language to another like that. The inference speed per token is likely to report as the same (since it is being measured by llama.cpp) but I am guessing the total time ends up longer because of the internal handoff between python and the llama binary.
I could be wrong there (I haven't dug super deep on llama-cpp-python, but I looked around a bit), but if that's the cause, there's really not much to do other than maybe have llama-cpp-python keep cross-language calls to a minimum, which will be pretty hard since it's acting as binding over llama and not as a text ingest point for an independent llama backend.
Recent merged PR helps with this, at least on smaller models: https://github.com/abetlen/llama-cpp-python/pull/420
I still had significant slowdown relative to native llama.cpp server on 30B and 65B models, though I'll have to do more thorough testing at some point.
@digiwombat How did you test it? Based on my test results, I found that the larger the model, the less noticeable the slowdown is. The difference in performance is within 2-3% when it comes to 30b models.
@shouyiwang I should state that my testing was very quick and dirty (only 3-4 test gens with each backend per model). My slowdown was closer to 10%. Which is still much better than before the change.
I'd say just take my stuff as inconclusive as I haven't done any work to narrow down the cause. It could be any number of things about my setup (ooba api/native server into simple-proxy->SillyTavern, straight Windows). I've been busy so I haven't gotten to do any more testing.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.
The original bug report was posted as bug report for llama-cpp-python, but I haven't got any replies yet. So, I decided to repost it here. Maybe we should evaluate this module further. I heard that using their lower-level API can improve its performance.
Summary:
When testing the latest version of llama-cpp-python (0.1.64) alongside the corresponding commit of llama.cpp, I observed that llama.cpp performs significantly faster than llama-cpp-python in terms of total time taken to execute. Additionally, GPU utilization is consistently higher for llama.cpp compared to llama-cpp-python.
Environment:
Background
First, I updated the text-generation-webui requirement to include the latest version of llama-cpp-python (0.1.64) manually. After installing the update, I ran tests and saw that the speed improved, but it was still much slower than llama.cpp.
To focus on llama-cpp-python's role, I wrote code to test llama-cpp-python separately.
Steps to Reproduce:
llama-cpp-python
conda list llama-cpp-python
make sure the version is0.1.64
Python test.py
.params = { 'model_path': "/home/wsy/Projects/text-generation-webui/models/guanaco-33B.ggmlv3.q4_K_M.bin", 'n_ctx': 1024, 'seed': 4, 'n_threads': 1, 'n_batch': 256, 'n_gpu_layers': 128 }
llm = Llama(**params)
stream = llm( "Write an essay about american history", max_tokens=1000, stream=True, )
for output in stream: print(output['choices'][0]['text'], end ='') sys.stdout.flush()
Expected Outcome:
Similar performance and GPU utilization between llama-cpp-python and llama.cpp.
Actual Outcome:
Output of llama-cpp-python:
Output of llama.cpp:
total time
is significantly larger than the sum ofsample time + prompt eval time + eval time
. In contrast, these times are consistent for llama.cpp.Updated Findings
I conducted more tests and discovered additional facts that could be useful in solving the problem:
total time != sample time + prompt eval time + eval time
issue.It seems that the problem has existed for quite some time. When llama.cpp was slow, it wasn't very noticeable, but now that llama.cpp is fast, it is much more evident.
I'd really appreciate it if you could investigate this performance discrepancy from the text-generation-webui's side. Maybe you could try using a low-level API instead?