oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.65k stars 5.31k forks source link

Llama-cpp-python has some serious performance issues #2788

Closed shouyiwang closed 1 year ago

shouyiwang commented 1 year ago

The original bug report was posted as bug report for llama-cpp-python, but I haven't got any replies yet. So, I decided to repost it here. Maybe we should evaluate this module further. I heard that using their lower-level API can improve its performance.

Summary:

When testing the latest version of llama-cpp-python (0.1.64) alongside the corresponding commit of llama.cpp, I observed that llama.cpp performs significantly faster than llama-cpp-python in terms of total time taken to execute. Additionally, GPU utilization is consistently higher for llama.cpp compared to llama-cpp-python.

Environment:

Background

First, I updated the text-generation-webui requirement to include the latest version of llama-cpp-python (0.1.64) manually. After installing the update, I ran tests and saw that the speed improved, but it was still much slower than llama.cpp.

To focus on llama-cpp-python's role, I wrote code to test llama-cpp-python separately.

Steps to Reproduce:

llama-cpp-python

  1. pip uninstall -y llama-cpp-python
    CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.64 --no-cache-dir
  2. conda list llama-cpp-python make sure the version is 0.1.64
  3. Write a test.py file using the following code, change the model file to any GGML model on your local machine, then Python test.py.
    
    import sys
    from llama_cpp import Llama

params = { 'model_path': "/home/wsy/Projects/text-generation-webui/models/guanaco-33B.ggmlv3.q4_K_M.bin", 'n_ctx': 1024, 'seed': 4, 'n_threads': 1, 'n_batch': 256, 'n_gpu_layers': 128 }

llm = Llama(**params)

stream = llm( "Write an essay about american history", max_tokens=1000, stream=True, )

for output in stream: print(output['choices'][0]['text'], end ='') sys.stdout.flush()

#### llama.cpp
1.  Go to llama.cpp folder, 
git pull
git checkout 8596af427722775f0df4a7c90b9af067ba90d4ef
make clean
make LLAMA_CUBLAS=1
```
  1. Run llama.cpp with the exact same parameters using the following command:
    ./main -m ../models/guanaco-33B.ggmlv3.q4_K_M.bin -p "Write an essay about american history" -ngl 128 -s 4 -n 1000 -t 1 --ctx-size 1024 --batch-size 256

Expected Outcome:

Similar performance and GPU utilization between llama-cpp-python and llama.cpp.

Actual Outcome:

Output of llama-cpp-python:

llama_print_timings:        load time =   450.16 ms
llama_print_timings:      sample time =   412.64 ms /  1000 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   450.12 ms /     9 tokens (   50.01 ms per token)
llama_print_timings:        eval time = 30622.88 ms /   999 runs   (   30.65 ms per token)
llama_print_timings:       total time = 39541.67 ms

Output of llama.cpp:

llama_print_timings:        load time =  2480.53 ms
llama_print_timings:      sample time =   426.18 ms /  1000 runs   (    0.43 ms per token)
llama_print_timings: prompt eval time =   447.96 ms /     9 tokens (   49.77 ms per token)
llama_print_timings:        eval time = 29871.26 ms /   999 runs   (   29.90 ms per token)
llama_print_timings:       total time = 30938.72 ms

Updated Findings

I conducted more tests and discovered additional facts that could be useful in solving the problem:

It seems that the problem has existed for quite some time. When llama.cpp was slow, it wasn't very noticeable, but now that llama.cpp is fast, it is much more evident.

I'd really appreciate it if you could investigate this performance discrepancy from the text-generation-webui's side. Maybe you could try using a low-level API instead?

oobabooga commented 1 year ago

This is what I get for ggml-vicuna-13B-1.1-q4_0.bin using exactly your commands

llama.cpp

llama_print_timings: sample time = 136,68 ms / 407 runs ( 0,34 ms per token) llama_print_timings: prompt eval time = 177,06 ms / 9 tokens ( 19,67 ms per token) llama_print_timings: eval time = 10170,22 ms / 406 runs ( 25,05 ms per token)

llama-cpp-python

llama_print_timings: sample time = 137.41 ms / 407 runs ( 0.34 ms per token) llama_print_timings: prompt eval time = 175.21 ms / 9 tokens ( 19.47 ms per token) llama_print_timings: eval time = 9979.90 ms / 406 runs ( 24.58 ms per token)

Performance is exactly the same on my system.

The original bug report was posted https://github.com/abetlen/llama-cpp-python/issues/398, but it seems like they don't care.

I get similar accusations myself. Most likely, he simply doesn't have time to reply to every single issue (as I do not for the issues here). llama-cpp-python is a well documented project with 693 commits. We should be grateful for the author's effort.

shouyiwang commented 1 year ago

@oobabooga It is weird.

After reading your comment, I rented a machine from runpod.io (RTX 3090, EPYC 7502, Ubuntu). Strictly following the procedure I listed above, I installed the test environment and conducted the test multiple times.

My results for guanaco-33B.ggmlv3.q4_K_M.bin:

Llama.cpp:

llama_print_timings:        load time =  3568.69 ms
llama_print_timings:      sample time =   846.83 ms /  1000 runs   (    0.85 ms per token)
llama_print_timings: prompt eval time =   526.94 ms /     9 tokens (   58.55 ms per token)
llama_print_timings:        eval time = 58901.63 ms /   999 runs   (   58.96 ms per token)
llama_print_timings:       total time = 60591.11 ms

Llama-cpp-python:

llama_print_timings:        load time =   530.60 ms
llama_print_timings:      sample time =   877.34 ms /  1000 runs   (    0.88 ms per token)
llama_print_timings: prompt eval time =   530.55 ms /     9 tokens (   58.95 ms per token)
llama_print_timings:        eval time = 63287.93 ms /   999 runs   (   63.35 ms per token)
llama_print_timings:       total time = 77040.04 ms

Llama-cpp-python takes 16.5 seconds longer to run.

Then I downloaded your test model: ggml-vic13b-q4_0.bin from here:

Llama.cpp:

llama_print_timings:        load time =  1808.81 ms
llama_print_timings:      sample time =   347.80 ms /   407 runs   (    0.85 ms per token)
llama_print_timings: prompt eval time =   197.25 ms /     9 tokens (   21.92 ms per token)
llama_print_timings:        eval time = 10216.32 ms /   406 runs   (   25.16 ms per token)
llama_print_timings:       total time = 10881.35 ms

Llama-cpp-python:

llama_print_timings:        load time =   199.19 ms
llama_print_timings:      sample time =   355.34 ms /   407 runs   (    0.87 ms per token)
llama_print_timings: prompt eval time =   199.14 ms /     9 tokens (   22.13 ms per token)
llama_print_timings:        eval time = 10455.50 ms /   406 runs   (   25.75 ms per token)
llama_print_timings:       total time = 13567.84 ms

With a much smaller model, Llama-cpp-python takes 2.7 seconds longer to run.

I was annoyed that nobody replied to me in 2 days. Sorry about that if you found my words to be offensive.

shouyiwang commented 1 year ago

@oobabooga The discrepancy in timing between llama.cpp and llama-cpp-python is not related to the three lines you mentioned. The issue lies in the total time, which you did not include in your reply. In llama-cpp-python, the total time is significantly longer than the sum of the time listed by itself above total time. total time != sample time + prompt eval time + eval time That is the problem we need to address.

Another way to check the difference is by using a smartphone as a timer. Just start the timer when you press the "return" key on the keyboard, and then compare the total execution time of the two.

digiwombat commented 1 year ago

Is this not just a function of the handoff between C++ land and Python land? It's pretty common to have bottleneck when calling functions from one language to another like that. The inference speed per token is likely to report as the same (since it is being measured by llama.cpp) but I am guessing the total time ends up longer because of the internal handoff between python and the llama binary.

I could be wrong there (I haven't dug super deep on llama-cpp-python, but I looked around a bit), but if that's the cause, there's really not much to do other than maybe have llama-cpp-python keep cross-language calls to a minimum, which will be pretty hard since it's acting as binding over llama and not as a text ingest point for an independent llama backend.

digiwombat commented 1 year ago

Recent merged PR helps with this, at least on smaller models: https://github.com/abetlen/llama-cpp-python/pull/420

I still had significant slowdown relative to native llama.cpp server on 30B and 65B models, though I'll have to do more thorough testing at some point.

shouyiwang commented 1 year ago

@digiwombat How did you test it? Based on my test results, I found that the larger the model, the less noticeable the slowdown is. The difference in performance is within 2-3% when it comes to 30b models.

digiwombat commented 1 year ago

@shouyiwang I should state that my testing was very quick and dirty (only 3-4 test gens with each backend per model). My slowdown was closer to 10%. Which is still much better than before the change.

I'd say just take my stuff as inconclusive as I haven't done any work to narrow down the cause. It could be any number of things about my setup (ooba api/native server into simple-proxy->SillyTavern, straight Windows). I've been busy so I haven't gotten to do any more testing.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.