Open zzzacwork opened 11 months ago
What is the sequence length set to in the model config? Maybe something weird is happening if you haven't changed it from the default (2048), and it tries to generate a negative number of tokens.
thanks for the reply
{
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 4096,
"max_length": 4096,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.32.0.dev0",
"use_cache": true,
"vocab_size": 32000
}
here is the model config file, I got the model from Llama-2-70B-chat-gptq
Is there more of this error message?
File "/codebase/research/exllama/model.py", line 556, in forward
cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds
It looks like it's been cut off.
Also, the line number is weird. Has something else been modified in model.py
, because ExLlamaAttention.forward
ends on line 502?
I got a similar error. It seems to come from trying to put too many tokens into the model. I was putting 5k words into the model.
File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 2190, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1486, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\example_flask.py", line 48, in inferContextP
outputs = generator.generate_simple(prompt, max_new_tokens=16000)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 316, in generate_simple
self.gen_begin(ids, mask = mask)
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 186, in gen_begin
self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 967, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 1053, in _forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 536, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 440, in forward
new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).
I'm using the_bloke/vicuna-13B-v1.5-16K-GPTQ which is supposed to be a 16k context model so it should be able to handle it. At any rate, this is the relevant portions of the config.json.
"max_sequence_length": 16384,
"max_position_embeddings": 4096,
What I found that worked was changing the parameters on lines 82-87 in model.py
self.max_seq_len = 16384 # Reduce to save memory. Can also be increased, ideally while also using compress_pos_emn and a compatible model/LoRA
self.max_input_len = 4096 # Maximum length of input IDs in a single forward pass. Sequences longer than this will be processed in multiple steps
self.max_attention_size = 2048**2 # Sequences will be processed in chunks to keep the size of the attention weights matrix <= this
self.compress_pos_emb = 4.0 # Increase to compress positional embeddings applied to sequence
Previously, these were 2048, 2048, 4096, and 1.0 respectively. This worked and seems to give reasonable results but I'm not sure if it's the correct way to go about it.
@w013nad Where do you define those changes? In the source code or generator model settings?
config = ExLlamaConfig(model_config_path) # create config from config.json
config.model_path = model_path # supply path to model weights file
model = ExLlama(config) # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path) # create tokenizer from tokenizer model file
cache = ExLlamaCache(model) # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache) # create generator
# Configure generator
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
generator.settings.max_seq_len = 16000
# Produce a simple generation
output = generator.generate_simple(prompt_template, max_new_tokens = 500)
I am using the same model but getting the following error
RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).
@w013nad You wouldn't need to hard-code new values into the config class. You can just override the values after creating the config.
Also, it looks like that config file is incorrect. "max_sequence_length" and "max_position_embeddings" should mean the same thing, or at least I don't know how to interpret those values if they're different.
The max_input_len
argument means specifically the longest sequence to allow during a forward pass. Longer sequences will be chunked into portions of this length to reduce VRAM usage during inference, and to make the VRAM requirement predictable which is sort of required when splitting the model across multiple devices. But max_attention_size
imposes an additional restriction on the chunk length. In short, setting max_input_len > sqrt(max_attention_size) just wastes a bit of VRAM.
@Rajmehta123 The max_seq_len
parameter is in the ExLlamaConfig
object, not the generator settings.
First of all, thanks a lot for this great project!
I got a weird issue when generating with llama 2 on 4096 context using
generator.generate_simple
,As I understand the code, it already limits the number of new tokens to under the context limit. Is there any settings that I might need to change?