su77ungr / CASALIOY

♾️ toolkit for air-gapped LLMs on consumer-grade hardware
Apache License 2.0
230 stars 31 forks source link

Performance Suggestion / Benchmarks #66

Open alxspiker opened 1 year ago

alxspiker commented 1 year ago

Max Threads = Poor Performance on 8 thread processor and GGJT model after convert.py

TL:DR - Try setting n_threads to 6 instead of 8 if you have an 8 thread processor. Getting consistently faster results than trying to use all of my 8 threads. Been doing some testing with a GGJT model to try to get the best performance on a little laptop. I did 2 tests for each change to n_threads. Tests were conducted while nothing else was open.

Results On an 8 thread CPU

n_threads=1

Test 1

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time = 14464.13 ms
llama_print_timings:      sample time =    20.63 ms /    40 runs   (    0.52 ms per run)
llama_print_timings: prompt eval time = 14463.85 ms /    19 tokens (  761.26 ms per token)
llama_print_timings:        eval time = 38962.48 ms /    39 runs   (  999.04 ms per run)
llama_print_timings:       total time = 57510.54 ms

Test 2

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time = 14054.52 ms
llama_print_timings:      sample time =    24.77 ms /    40 runs   (    0.62 ms per run)
llama_print_timings: prompt eval time = 14054.15 ms /    19 tokens (  739.69 ms per token)
llama_print_timings:        eval time = 50090.37 ms /    39 runs   ( 1284.37 ms per run)
llama_print_timings:       total time = 69022.43 ms

n_threads=2

Test 1

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time =  9662.71 ms
llama_print_timings:      sample time =    22.36 ms /    40 runs   (    0.56 ms per run)
llama_print_timings: prompt eval time =  9662.48 ms /    19 tokens (  508.55 ms per token)
llama_print_timings:        eval time = 25339.74 ms /    39 runs   (  649.74 ms per run)
llama_print_timings:       total time = 39422.48 ms

Test 2

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time = 13699.18 ms
llama_print_timings:      sample time =    27.64 ms /    40 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time = 13698.78 ms /    19 tokens (  720.99 ms per token)
llama_print_timings:        eval time = 27051.24 ms /    39 runs   (  693.62 ms per run)
llama_print_timings:       total time = 46124.61 ms

n_threads=4

Test 1

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time =  9804.36 ms
llama_print_timings:      sample time =    29.62 ms /    40 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =  9803.58 ms /    19 tokens (  515.98 ms per token)
llama_print_timings:        eval time = 22367.64 ms /    39 runs   (  573.53 ms per run)
llama_print_timings:       total time = 38015.92 ms

Test 2

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time =  7894.51 ms
llama_print_timings:      sample time =    23.41 ms /    40 runs   (    0.59 ms per run)
llama_print_timings: prompt eval time =  7894.35 ms /    19 tokens (  415.49 ms per token)
llama_print_timings:        eval time = 17166.80 ms /    39 runs   (  440.17 ms per run)
llama_print_timings:       total time = 29655.03 ms

n_threads=6

Test 1

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time =  8732.21 ms
llama_print_timings:      sample time =    29.93 ms /    40 runs   (    0.75 ms per run)
llama_print_timings: prompt eval time =  8731.88 ms /    19 tokens (  459.57 ms per token)
llama_print_timings:        eval time = 26798.23 ms /    39 runs   (  687.13 ms per run)
llama_print_timings:       total time = 41384.27 ms

Test 2

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time =  4623.47 ms
llama_print_timings:      sample time =    21.79 ms /    40 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =  4623.19 ms /    19 tokens (  243.33 ms per token)
llama_print_timings:        eval time = 17870.62 ms /    39 runs   (  458.22 ms per run)
llama_print_timings:       total time = 26962.23 ms

n_threads=7 (Seems better than 8, but not as good as 6)

Test 1

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time = 13266.94 ms
llama_print_timings:      sample time =    22.37 ms /    40 runs   (    0.56 ms per run)
llama_print_timings: prompt eval time = 13266.64 ms /    19 tokens (  698.24 ms per token)
llama_print_timings:        eval time = 31370.05 ms /    39 runs   (  804.36 ms per run)
llama_print_timings:       total time = 49092.33 ms

Test 2

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time =  9676.00 ms
llama_print_timings:      sample time =    30.28 ms /    40 runs   (    0.76 ms per run)
llama_print_timings: prompt eval time =  9675.46 ms /    19 tokens (  509.23 ms per token)
llama_print_timings:        eval time = 51035.98 ms /    39 runs   ( 1308.61 ms per run)
llama_print_timings:       total time = 66633.10 ms

n_threads=8 (Max threads)

Test 1

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time = 31573.62 ms
llama_print_timings:      sample time =    23.12 ms /    40 runs   (    0.58 ms per run)
llama_print_timings: prompt eval time = 31573.35 ms /    19 tokens ( 1661.76 ms per token)
llama_print_timings:        eval time = 80649.37 ms /    39 runs   ( 2067.93 ms per run)
llama_print_timings:       total time = 119573.09 ms

Test 2

1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings:        load time = 31926.09 ms
llama_print_timings:      sample time =    22.00 ms /    40 runs   (    0.55 ms per run)
llama_print_timings: prompt eval time = 31925.73 ms /    19 tokens ( 1680.30 ms per token)
llama_print_timings:        eval time = 67654.42 ms /    39 runs   ( 1734.73 ms per run)
llama_print_timings:       total time = 103776.36 ms
alxspiker commented 1 year ago

Script used for benchmarking: Requires llama-cpp-python==0.1.49

import json
import argparse

from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="./newggjt.bin")
args = parser.parse_args()

llm = Llama(model_path=args.model, n_threads=6)

stream = llm(
    "Question: What are the names of the planets in the solar system? Answer: ",
    max_tokens=48,
    stop=["Q:", "\n"],
    stream=True,
)

for output in stream:
    print(output["choices"][0]["text"], end="")
    #print(json.dumps(output, indent=2))