turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

exllamav2 very slow in compare to llama-cpp-python...? Or did I somethong wrong? #401

Open rsoika opened 2 months ago

rsoika commented 2 months ago

Hi,

I tried to use exllamv2 with Mistral 7B Instruct instead of my llama-cpp-python test implementation. exllamv2 works, but the performance is very slow compared to llama-cpp-python.

To me this looks like the GPU is totally ignored? I have CUDA installed and run the code in a nvidia/cuda Docker container

This is how my Test Code looks (it is from the examples directory):

import sys, os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

import time

# Initialize model and cache

#model_directory =  "/models/Mistral-7B-Instruct-v0.2-5.0-bpw-exl2/"
model_directory =  "/models/Mistral-7B-Instruct-2.5bpw/"
print("Loading model:1 " + model_directory)

config = ExLlamaV2Config(model_directory)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
cache.current_seq_len = 0
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

# Generate some text
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.01
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
prompt = "Our story begins in the Scottish town of Auchtermuchty, where once"

max_new_tokens = 150
generator.warmup()
time_begin = time.time()

output = generator.generate_simple(prompt, settings, max_new_tokens, seed = 1234)

time_end = time.time()
time_total = time_end - time_begin

print(output)
print()
print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")

Is it necessary to activate the GPU somehow?

turboderp commented 2 months ago

What GPU are you using? ExLlama isn't great on older GPUs with poor FP16 performance.

rsoika commented 2 months ago

I am running on a linux server with CPU Intel Core i7-7700 + GeForce GTX 1080. For example the call of model.load_autosplit(cache) takes more than 3 minutes. The model I am using has a size of 2.4 G Is this something you would expect in this situation?

turboderp commented 2 months ago

Well, I haven't optimized specifically for the 10-series GPUs. Even though the 1080 supports FP16, it runs at about 1/64th the speed of FP32. I have been meaning to add some FP32 fallback kernels to ExLlama, but it's a lot of work and I just haven't found the time yet.

rsoika commented 2 months ago

ok, thanks for your feedback. It was more for my understanding about the architecture. I was not aware that my GPU is such 'old' ;-) - no worry all is fine.

Ph0rk0z commented 2 months ago

Besides FP32, I thought the pascal series has fast int8 support. Some places say that it's 4xFP32.