turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.54k stars 274 forks source link

What might be different between HF generator vs ExLlamav2 generator? #535

Open fahadh4ilyas opened 3 months ago

fahadh4ilyas commented 3 months ago

So, I've been testing to generate text using exllamav2 with some config following huggingface generator. Here is my script

import torch, os
from contextlib import contextmanager
from pathlib import Path
from typing import Optional, List, Union, Dict
from transformers import AutoConfig, PretrainedConfig
from transformers.generation.utils import GenerationMixin, GenerationConfig
from transformers.modeling_outputs import CausalLMOutputWithPast

from exllamav2 import ExLlamaV2, ExLlamaV2Cache, ExLlamaV2Config, ExLlamaV2Lora

class ExLlamaV2ForCausalLM(GenerationMixin):

    def __init__(
        self,
        config: PretrainedConfig,
        generation_config: GenerationConfig,
        exllama_config: ExLlamaV2Config,
        model: ExLlamaV2,
        loras: Dict[str, ExLlamaV2Lora] = {'': None},
        active_adapter: str = '',
        **kwargs
    ):
        self.config = config
        self.generation_config = generation_config
        self.exllama_config = exllama_config
        self.model = model
        self.loras = loras
        if '' not in self.loras:
            self.loras[''] = None
        self._active_adapter = active_adapter
        self._adapter_enabled = True
        if active_adapter == '':
            self.disable_adapter_layers()

    def can_generate(self):
        return True

    @property
    def _supports_cache_class(self) -> bool:
        return False

    @property
    def device(self) -> torch.device:
        return torch.device(0)

    @property
    def main_input_name(self) -> str:
        return 'input_ids'

    @property
    def active_adapters(self) -> List[str]:
        return [self._active_adapter] if self._adapter_enabled else []

    @property
    def active_adapter(self) -> List[str]:
        return self._active_adapter if self._adapter_enabled else ''

    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        return {'input_ids': input_ids, **kwargs}

    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)

    def forward(
        self,
        input_ids: torch.LongTensor,
        attention_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_size: int = -1,
        **kwargs
    ):
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        loras = self.loras.get(self.active_adapter, None)
        loras = [loras] if loras else loras

        if labels is None:
            if past_key_values is None:
                past_key_values = ExLlamaV2Cache(self.model, input_ids.shape[0], cache_size)
                self.model.forward(input_ids[...,:-1], past_key_values, preprocess_only=True, loras=loras, input_mask=attention_mask[...,:-1].to(torch.bool))

            logits = self.model.forward(input_ids[...,-1:], past_key_values, loras=loras, input_mask=attention_mask.to(torch.bool)).to(input_ids.device)
        else:
            if past_key_values is None:
                past_key_values = ExLlamaV2Cache(self.model, input_ids.shape[0], cache_size)

            logits = self.model.forward(input_ids, past_key_values, loras=loras, input_mask=attention_mask.to(torch.bool))

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = torch.nn.CrossEntropyLoss()
            shift_logits = shift_logits.view(-1, logits.shape[-1])
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            shift_labels = shift_labels.to(shift_logits.device)
            loss = loss_fct(shift_logits, shift_labels)

        if not return_dict:
            output = (logits, past_key_values if use_cache else None)
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values if use_cache else None, loss=loss)

    def load_adapter(self, lora_path: Union[str, os.PathLike], adapter_name: str):

        if adapter_name in self.loras:
            raise ValueError('This adapter is already existed')

        if isinstance(lora_path, str):
            lora_path = Path(lora_path)

        lora_model = ExLlamaV2Lora.from_directory(self.model, lora_path)

        self.loras[adapter_name] = lora_model

    def set_adapter(self, adapter_name: str):

        if adapter_name not in self.loras:
            raise ValueError('The adapter is not existed')

        self._active_adapter = adapter_name

    def enable_adapter_layers(self):

        self._adapter_enabled = True

    def disable_adapter_layers(self):

        self._adapter_enabled = False

    @contextmanager
    def disable_adapter(self):

        try:
            self.disable_adapter_layers()
            yield
        finally:
            self.enable_adapter_layers()

    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: Union[str, os.PathLike],
        gpu_split: Optional[str] = None,
        lora_path: Optional[Union[str, os.PathLike]] = None,
        adapter_name: str = 'default',
        trust_remote_code: bool = False,
        use_flash_attention_2: bool = False
    ):
        if isinstance(pretrained_model_name_or_path, str):
            pretrained_model_name_or_path = Path(pretrained_model_name_or_path)

        if isinstance(lora_path, str):
            lora_path = Path(lora_path)

        config = AutoConfig.from_pretrained(pretrained_model_name_or_path, trust_remote_code=trust_remote_code)

        try:
            generation_config = GenerationConfig.from_pretrained(pretrained_model_name_or_path, trust_remote_code=trust_remote_code)
        except:
            generation_config = GenerationConfig()

        exllama_config = ExLlamaV2Config()
        exllama_config.model_dir = pretrained_model_name_or_path
        exllama_config.no_flash_attn = not use_flash_attention_2
        if getattr(config, 'rope_scaling', None) is not None:
            if config.rope_scaling['type'] == 'linear':
                exllama_config.scale_pos_emb = config.rope_scaling['factor']
            elif config.rope_scaling['type'] == 'dynamic':
                exllama_config.scale_alpha_value = config.rope_scaling['factor']
            exllama_config.rope_config = config.rope_scaling
        exllama_config.prepare()

        model = ExLlamaV2(exllama_config)
        if gpu_split is not None:
            gpu_split = [float(d) for d in gpu_split.split(' ')]
        model.load(gpu_split=gpu_split)

        lora_model = None
        if lora_path is not None:
            lora_model = ExLlamaV2Lora.from_directory(model, lora_path)

        if lora_model is None:
            adapter_name = ''

        return cls(config, generation_config, exllama_config, model, {adapter_name: lora_model}, adapter_name)

    @staticmethod
    def _reorder_cache(past_key_values: ExLlamaV2Cache, beam_idx):

        for i in range(len(past_key_values.key_states)):
            past_key_values.key_states[i] = past_key_values.key_states[i].index_select(0, beam_idx.to(past_key_values.key_states[i].device))
            past_key_values.value_states[i] = past_key_values.value_states[i].index_select(0, beam_idx.to(past_key_values.value_states[i].device))

        return past_key_values

Here is what inside my directory

my-llama3
├── config.json
├── generation_config.json
├── output.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

Here is how I generate text:

from transformers import AutoTokenizer

model = ExLlamaV2ForCausalLM.from_pretrained(
    'my-llama3',
    use_flash_attention_2=True)
tokenizer = AutoTokenizer.from_pretrained('my-llama3')

prompt = '<|start_header_id|>system<|end_header_id|>\n\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nApa itu AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

inputs = tokenizer(prompt, return_tensors='pt').to('cuda')

output = model.generate(**inputs)[0]

print(tokenizer.decode(output))

Here is the result from my code:

<|start_header_id|>system<|end_header_id|>

You are an AI assistant that follows instruction extremely well. Help as much as you can.<|eot_id|><|start_header_id|>user<|end_header_id|>

Apa itu AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The term "AI" stands for Artificial Intelligence, which refers to the ability of machines or computer systems to perform tasks that typically require human intelligence, such as learning, problem-solving, perception, reasoning, and decision-making.

Artificial Intelligence (AI) is a broad field that encompasses various subfields like Machine Learning, Natural Language Processing, Robotics, Computer Vision, Speech Recognition, etc. It involves creating intelligent agents, i.e., software programs capable of performing specific tasks without explicit instructions from humans.

There are different types of AI:

1. Narrow AI: Also known as Weak AI, it focuses on solving one particular task with high accuracy.
2. General AI: This type of AI aims at achieving human-level intelligence across multiple domains.
3. Superhuman AI: A hypothetical form of AI that surpasses human capabilities in terms of speed, efficiency, and performance.

In recent years, there has been significant progress in developing advanced AI technologies, including deep neural networks, reinforcement learning algorithms, and natural language processing models. These advancements have led to breakthroughs in fields ranging from healthcare to finance, transportation, education, entertainment, and more. However, concerns about potential negative impacts of AI, such as job displacement, privacy violations, and ethical dilemmas, continue to be discussed by experts and policymakers worldwide.

But, using your generation example, my result is here

<|start_header_id|>system<|end_header_id|>

You are an AI assistant that follows instruction extremely well. Help as much as you can.<|eot_id|><|start_header_id|>user<|end_header_id|>

Apa itu AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

AI, atau kecerdasan buatan, adalah bidang studi dan teknik yang melibatkan pengembangan mesin yang dapat melakukan tugas-tugas yang biasanya memerlukan kecerdasan manusia, seperti pengambilan keputusan, pemecahan masalah, perencanaan, dan pembelajaran. AI didasarkan pada algoritma dan model statistik yang dapat ditingkatkan secara otomatis dengan data dan umpan balik dari berbagai sumber. AI digunakan dalam berbagai aplikasi, termasuk pembelajaran mesin, pemrosesan bahasa alami, robotika, dan sistem kecerdasan.

And what I want is your generator's answer. Is there something wrong with my implementation? Could you help me find it? I actually wanna keep using HF implementation because I want to unite the logits processor across all kind of model. That's why I make that implementation.

turboderp commented 3 months ago

It's hard to guarantee a particular output because it's a dynamic process and the output can vary wildly with small differences in initial conditions or numerical precision. The response you're getting isn't wrong, it's just not in the language you seem to be expecting. Since you haven't instructed the model to respond in Indonesian, and the system prompt is in English, it's likely going to be somewhat random which path the sampler chooses to go down.

Now, I'm not sure which of the examples you're referring to, but the default sampling settings for most of them are:

Since you're not supplying any settings to the HF generator, you'd want to check what the defaults are. I believe it defaults to greedy sampling? If so it could be that the English response is simply more likely (given the English system prompt, for instance), meaning my examples would also give you an English response most of the time, but the randomness allows it to choose differently sometimes.

Anyway, try to match those settings in model.generate. For consistency, though, I would look at the system prompt. Either write it in Indonesian or add an instruction to respond in the language of the question being asked.

fahadh4ilyas commented 3 months ago

It's hard to guarantee a particular output because it's a dynamic process and the output can vary wildly with small differences in initial conditions or numerical precision. The response you're getting isn't wrong, it's just not in the language you seem to be expecting. Since you haven't instructed the model to respond in Indonesian, and the system prompt is in English, it's likely going to be somewhat random which path the sampler chooses to go down.

Now, I'm not sure which of the examples you're referring to, but the default sampling settings for most of them are:

  • repetition penalty: 1.025 (should be equivalent to the HF implementation, but I'm not 100% on that)
  • temperature: 0.8
  • top-K: 50
  • top-P: 0.8

Since you're not supplying any settings to the HF generator, you'd want to check what the defaults are. I believe it defaults to greedy sampling? If so it could be that the English response is simply more likely (given the English system prompt, for instance), meaning my examples would also give you an English response most of the time, but the randomness allows it to choose differently sometimes.

Anyway, try to match those settings in model.generate. For consistency, though, I would look at the system prompt. Either write it in Indonesian or add an instruction to respond in the language of the question being asked.

My model is already fine tuned to understand indonesian and english prompt and answer it accordingly. I already test it in non-quantized mode and the response is what I want. But, somehow my HF implementation of exllamav2 is forcing the model to answer in english. Even after I ask it to answer in Indonesian, it begins to answer the first half in english and then suddenly in Indonesian.

But, using your generation implementation, it's not the case. It always answers in Indonesian (albeit not really a good answer vs the non quantized version but still it answers in Indonesian). That's what surprise me and makes me think that maybe there is something wrong with the way I implement it.

fahadh4ilyas commented 3 months ago

Now, I'm not sure which of the examples you're referring to, but the default sampling settings for most of them are:

  • repetition penalty: 1.025 (should be equivalent to the HF implementation, but I'm not 100% on that)
  • temperature: 0.8
  • top-K: 50
  • top-P: 0.8

I test this and the result even more weird and nonsensical. Sometime it repeats the question, sometime it's wondering around like a drunk AI.