turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Increase GPU utilization? #506

Closed sunflower-leaf closed 1 week ago

sunflower-leaf commented 2 weeks ago

Hi, I'm using exllamav2 to process a large amount of input data, where each data item is a single sentence, plus system message, some user instructions and example responses. I created a class according examples/inference.py to handle the parsing:

class Llama3Interface:
    def __init__(self, model_dir, gpu_split=[30,30]) -> None:
        config = ExLlamaV2Config(model_dir=model_dir)
        config.fasttensors = True
        config.prepare()

        self.model = ExLlamaV2(config)
        self.model.load(gpu_split, progress=True)
        self.cache = ExLlamaV2Cache(self.model, max_seq_len=32768, lazy=False)
        self.tokenizer = ExLlamaV2Tokenizer(config)

        self.generator = ExLlamaV2DynamicGenerator(
            model=self.model, cache=self.cache, tokenizer=self.tokenizer
        )
        self.generator.warmup()

    #
    # ... code ...
    #

    def batch_chat_completion(self, dialogs_batch: list, temperature=0.95):
        # generate prompts according (adding special tokens)
        prompts = self.get_prompts(dialogs_batch)

        responses = self.generator.generate(
            prompt=prompts,
            max_new_tokens=1024,
            gen_settings=ExLlamaV2Sampler.Settings(temperature=temperature),
            encode_special_tokens=True,
            completion_only=True
        )
        return responses

Here batch_chat_completion() is called with a list of dialogs (system, user, assistant stuff), and responses are generated in a batch. The batch sizes are typically around 50, and input prompts have mostly around 1000 tokens.

When I run my code, I got GPU usage for two A40 GPUs at around 30% ~ 60%, which isn't very high. Is there any way perhaps to increase the GPU utilization and, as a result, also the throughput? (I tried with larger batch size, but GPU usage didn't really increase.)

Thanks!

turboderp commented 2 weeks ago

It's normal for two GPUs to hover around 50% utilization because there's no tensor-parallel implementation (yet?)

It's something I would love to add, but it would be a massive change to the codebase (and the API), so I'm not sure when it's realistic to expect it.

sunflower-leaf commented 2 weeks ago

Thanks for the explanation. Would you mind recommending some libraries that support this, i.e. increasing throughput with parallel generation within a batch?

turboderp commented 1 week ago

AphroditeEngine, vLLM, TGI to name a couple. There's no one-size solution, though, and simply increasing GPU utilization may not get you more tokens per second. But try some different ones out. If you write for an OpenAI endpoint, the backends become largely interchangeable, and you can use TabbyAPI for an ExLlamaV2 backend.

sunflower-leaf commented 1 week ago

Yeah guess I'll try them out and decide. Thanks for your kind reply!