turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Weird issue with context length #220

Open zzzacwork opened 11 months ago

zzzacwork commented 11 months ago

First of all, thanks a lot for this great project!

I got a weird issue when generating with llama 2 on 4096 context using generator.generate_simple,

  File "/codebase/research/exllama/model.py", line 556, in forward
    cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds 

As I understand the code, it already limits the number of new tokens to under the context limit. Is there any settings that I might need to change?

turboderp commented 11 months ago

What is the sequence length set to in the model config? Maybe something weird is happening if you haven't changed it from the default (2048), and it tries to generate a negative number of tokens.

zzzacwork commented 11 months ago

thanks for the reply

{
    "architectures": [
        "LlamaForCausalLM"
    ],
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 8192,
    "initializer_range": 0.02,
    "intermediate_size": 28672,
    "max_position_embeddings": 4096,
    "max_length": 4096,
    "model_type": "llama",
    "num_attention_heads": 64,
    "num_hidden_layers": 80,
    "num_key_value_heads": 8,
    "pad_token_id": 0,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_scaling": null,
    "tie_word_embeddings": false,
    "torch_dtype": "float16",
    "transformers_version": "4.32.0.dev0",
    "use_cache": true,
    "vocab_size": 32000
}

here is the model config file, I got the model from Llama-2-70B-chat-gptq

turboderp commented 11 months ago

Is there more of this error message?

  File "/codebase/research/exllama/model.py", line 556, in forward
    cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds 

It looks like it's been cut off.

Also, the line number is weird. Has something else been modified in model.py, because ExLlamaAttention.forward ends on line 502?

w013nad commented 11 months ago

I got a similar error. It seems to come from trying to put too many tokens into the model. I was putting 5k words into the model.

  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\example_flask.py", line 48, in inferContextP
    outputs = generator.generate_simple(prompt, max_new_tokens=16000)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 316, in generate_simple
    self.gen_begin(ids, mask = mask)
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 186, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 1053, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 536, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 440, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

I'm using the_bloke/vicuna-13B-v1.5-16K-GPTQ which is supposed to be a 16k context model so it should be able to handle it. At any rate, this is the relevant portions of the config.json.

    "max_sequence_length": 16384,
    "max_position_embeddings": 4096,

What I found that worked was changing the parameters on lines 82-87 in model.py

        self.max_seq_len = 16384  # Reduce to save memory. Can also be increased, ideally while also using compress_pos_emn and a compatible model/LoRA
        self.max_input_len = 4096  # Maximum length of input IDs in a single forward pass. Sequences longer than this will be processed in multiple steps
        self.max_attention_size = 2048**2  # Sequences will be processed in chunks to keep the size of the attention weights matrix <= this
        self.compress_pos_emb = 4.0  # Increase to compress positional embeddings applied to sequence

Previously, these were 2048, 2048, 4096, and 1.0 respectively. This worked and seems to give reasonable results but I'm not sure if it's the correct way to go about it.

Rajmehta123 commented 9 months ago

@w013nad Where do you define those changes? In the source code or generator model settings?

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
generator.settings.max_seq_len = 16000
# Produce a simple generation

output = generator.generate_simple(prompt_template, max_new_tokens = 500)

I am using the same model but getting the following error

RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

turboderp commented 9 months ago

@w013nad You wouldn't need to hard-code new values into the config class. You can just override the values after creating the config.

Also, it looks like that config file is incorrect. "max_sequence_length" and "max_position_embeddings" should mean the same thing, or at least I don't know how to interpret those values if they're different.

The max_input_len argument means specifically the longest sequence to allow during a forward pass. Longer sequences will be chunked into portions of this length to reduce VRAM usage during inference, and to make the VRAM requirement predictable which is sort of required when splitting the model across multiple devices. But max_attention_size imposes an additional restriction on the chunk length. In short, setting max_input_len > sqrt(max_attention_size) just wastes a bit of VRAM.

@Rajmehta123 The max_seq_len parameter is in the ExLlamaConfig object, not the generator settings.