Error after the generation. AssertionError: Total sequence length exceeds cache size in model.forward

This is my code. The input ids is less than the max seq len 4096.

Error:

  File "/home/ec2-user/anaconda3/envs/obaga/lib/python3.10/site-packages/exllamav2/generator/streaming.py", line 111, in stream
    next_token, eos = self._gen_single_token(self.settings)
  File "/home/ec2-user/anaconda3/envs/obaga/lib/python3.10/site-packages/exllamav2/generator/streaming.py", line 203, in _gen_single_token
    logits = self.model.forward(self.sequence_ids[:, -1:], self.cache).float().cpu()
  File "/home/ec2-user/anaconda3/envs/obaga/lib/python3.10/site-packages/exllamav2/model.py", line 339, in forward
    assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward

Code:

  generated_tokens = 0
  ans = ''
  max_new_tokens = 512
  while True:
      chunk, eos, _ = generator.stream()
      generated_tokens += 1
      if eos or generated_tokens == max_new_tokens:
          break
      print (chunk, end = "")
      sys.stdout.flush()
      yield chunk

Do you get any output from the model before that error message? My guess would be that either the max sequence length is not set in the right place (it should be after config.prepare() or else your value will be overwritten as it reads defaults from the model config), or the input IDs plus the extra 512 tokens tokens exceed the cache size.

I am facing the same problem as @Rajmehta123.

Code:

class QuizGenerator():

  def generate_template(self, question_type, format_instructions):

    template = f"""
    <s>[INST]You are an expert at creating scenario based {question_type} and answers based on the given text and documentation.
    Your goal is to make users acquainted with the content of the text through the questions you generate.
    Also, for each question, give the correct answer and don't use incomplete sentences as context.
    Do not repeat the same question. Do not rephrase/jumble the same question multiple times.
    You should also make sure to include the question and solution in one single post and not split them apart.
    Make sure not to lose any important information.
    Set the difficulty of the questions to {self.difficulty} out of easy/medium/difficult. Make sure all questions are set to the specified difficulty.
    Create {self.number_of_questions_per_chunk} questions and return the correct answer. Do not produce more or less questions than asked.

    The input text given is:

    ------------
    {self.text}
    ------------

    [/INST]</s>"""

    return template

  def generate_template_and_format(self, ):

    match self.question_type.lower():
        case "mcq":
            response_schema = [
                ResponseSchema(name="question", description="A multiple choice question generated from input text snippet."),
                ResponseSchema(name="option_1", description="First option for the multiple choice question. Use this format: 'a) option'"),
                ResponseSchema(name="option_2", description="Second option for the multiple choice question. Use this format: 'b) option'"),
                ResponseSchema(name="option_3", description="Third option for the multiple choice question. Use this format: 'c) option'"),
                ResponseSchema(name="option_4", description="Fourth option for the multiple choice question. Use this format: 'd) option'"),
                ResponseSchema(name="answer", description="Correct answer for the question. Use this format: 'd) option' or 'b) option', etc.")
            ]
            format_instructions = self.parse_output(response_schema)
            # print(f"format_inst: {format_instructions}")
            template = self.generate_template("Multiple Choice Questions", format_instructions)

    return template

  def create_llm_chain(self, ):
    template, format_instructions = self.generate_template_and_format()
    prompt = PromptTemplate(
        input_variables=["text"],
        template=template,
        partial_variables={"format_instructions": format_instructions},
    )

    llm_chain = LLMChain(llm=self.llm, prompt=prompt)
    # print(f"template: {template}")
    # print(f"format_instructions: {format_instructions}")
    return llm_chain

  # Create questions based on the the entire input text when len(input_text) > LLM context window
  def generate_qa(self, file_path=None, text=None):

    if file_path and text:
      print("Input either a file or text data. Not both.")
      return

    elif file_path:
      chunks = self.file_processing(file_path, 5000, 50)

    elif text:
      # print(chunks[0])
      # print(f"text: {text}")
      chunks = self.text_splitter(text, 5000, 50)

    else:
        print("Please provide either a file path or text data.")
        return

    self.number_of_questions_per_chunk = math.ceil(self.number_of_questions / len(chunks))
    print(f"Length of chunks: {len(chunks)}")
    # llm_chain = self.create_llm_chain()

    start = time.time()
    results = []
    for chunk in chunks:
      self.text = chunk
      prompt = self.generate_template_and_format()
      # print(prompt)
      instruction_ids = tokenizer.encode(f"[INST] {prompt} {chunk} [/INST]", add_bos=True)
      context_ids = instruction_ids if generator.sequence_ids is None \
        else torch.cat([generator.sequence_ids, instruction_ids], dim = -1)

      generator.begin_stream(context_ids, gen_settings)

      response = ""
      while True:
          chunk, eos, _ = generator.stream()
          if eos: break
          print(chunk, end = "")
          sys.stdout.flush()

Error:

AssertionError                            Traceback (most recent call last)
[<ipython-input-19-bae211f1c8a9>](https://localhost:8080/#) in <cell line: 2>()
      1 qa = QuizGenerator(model, "open-ended", 2, "difficult")
----> 2 text = qa.generate_qa(file_path="./ORCA User Types.pdf", prompt=prompt)

4 frames
[<ipython-input-18-a5cf9ef3cacc>](https://localhost:8080/#) in generate_qa(self, file_path, text, prompt)
    178       response = ""
    179       while True:
--> 180           chunk, eos, _ = generator.stream()
    181           if eos: break
    182           print(chunk, end = "")

[/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py](https://localhost:8080/#) in stream(self)
    141         # Generate a single token and append to the sequence
    142 
--> 143         next_token, eos = self._gen_single_token(self.settings)
    144 
    145         # End immediately if it was a stop token

[/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py](https://localhost:8080/#) in _gen_single_token(self, gen_settings, prefix_token)
    309         if self.draft_model is None:
    310 
--> 311             logits = self.model.forward(self.sequence_ids[:, -1:], self.cache, loras = self.active_loras).float().cpu()
    312             token, _, eos = ExLlamaV2Sampler.sample(logits, gen_settings, self.sequence_ids, random.random(), self.tokenizer, prefix_token)
    313 

[/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

[/usr/local/lib/python3.10/dist-packages/exllamav2/model.py](https://localhost:8080/#) in forward(self, input_ids, cache, input_mask, preprocess_only, last_id_only, loras, return_last_state, position_offsets)
    555 
    556         past_len = cache.current_seq_len
--> 557         assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
    558 
    559         # Split sequence

AssertionError: Total sequence length exceeds cache size in model.forward

How do I set the max sequence length? Also, the model is not generating based on the prompt. I change the prompt to generate true-false questions and yet it produces multiple choice questions. Or I tell it to produce 2 questions and it still produces 10. When I run it without exllama, it follows my commands. Is my initialization wrong in anyway?

If you're using <s> in the input to tokenizer.encode, you need to call it with encode_special_tokens=True and without add_bos=True. You've also got multiple prompt templates there. First you add the [INST] etc. tags to the prompt string, then you're adding them again when encoding it.

You can set max_seq_len in the config right after config.prepare but before model.load. The default value is whatever is specified in the model's config.json.

class MixtralTextGeneration():
    def __init__(self):        
        model_directory =  "/data/Prasanthi/text-generation-webui/models/turboderp_Mixtral-8x7B-instruct-exl2_3.0bpw"
        config = ExLlamaV2Config()
        config.model_dir = model_directory
        config.prepare()
        # Initialize model and cache
        self.model = ExLlamaV2(config)      
        self.cache = ExLlamaV2Cache(self.model, lazy = True)
        self.model.load_autosplit(self.cache)
        self.tokenizer = ExLlamaV2Tokenizer(config)
        # Initialize generator
        self.generator = ExLlamaV2StreamingGenerator(self.model, self.cache, self.tokenizer)
        self.generator.set_stop_conditions([self.tokenizer.eos_token_id])
        self.gen_settings = ExLlamaV2Sampler.Settings()

    def generate_text(self,prompt):
        prompt=prompt.strip()
        # print(prompt)
        instruction_ids = self.tokenizer.encode(f"[INST] {prompt} [/INST]", add_bos = True)
        context_ids = instruction_ids if self.generator.sequence_ids is None \
        else torch.cat([self.generator.sequence_ids, instruction_ids], dim = -1)
        start_time=time.time()
        self.generator.begin_stream(context_ids, self.gen_settings)
        display_text=''

        while True:
            chunk, eos, _ = self.generator.stream()
            if eos: break
            # print(chunk, end = "")
            display_text+=chunk
            # sys.stdout.flush()
        end_time=time.time()
        print(display_text)
        answer={}
        answer['response']=str(display_text).strip()
        answer['time']=round(end_time-start_time,2)
        return answer
load_mixtral=MixtralTextGeneration()

This is my code and I am facing the same error: Traceback (most recent call last): File "/data/Prasanthi/myenv/lib/python3.11/site-packages/tornado/web.py", line 1784, in _execute result = method(*self.path_args, *self.path_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/mixtral_deploy.py", line 76, in post response=load_mixtral.generate_text(input) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/mixtral_deploy.py", line 50, in generate_text self.generator.begin_stream(context_ids, self.gen_settings) File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 88, in begin_stream self._gen_begin_reuse(input_ids, gen_settings) File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 283, in _gen_begin_reuse if reuse < in_tokens.shape[-1]: self._gen_feed_tokens(in_tokens[:, reuse:], gen_settings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 299, in _gen_feed_tokens self.model.forward(self.sequence_ids[:, start : -1], self.cache, preprocess_only = True, loras = self.active_loras) File "/data/Prasanthi/myenv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/model.py", line 557, in forward assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Total sequence length exceeds cache size in model.forward Could you please help to come out of this error?

I think this issue may have something to do with uninitialized config cache. I can reliably reproduce the issue by trying to load (in oobabooga, using 4-bit cache) FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal as the very first model after textui startup.

AFAICT, it reliably happens as soon as the input size reaches 2K. For the model above the limit should have been 32764.

However, if I first load and then unload a different model (Dracones/Midnight-Miqu-70B-v1.5_exl2_4.0bpw, also using 4-bit cache), subsequent loading of FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal works fine.

Loading of the other model apparently puts the right values in the cache.max_seq_len and everything works fine up to the full 32K context.

I think this issue may have something to do with uninitialized config cache. I can reliably reproduce the issue by trying to load (in oobabooga, using 4-bit cache) FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal as the very first model after textui startup.

AFAICT, it reliably happens as soon as the input size reaches 2K. For the model above the limit should have been 32764.

However, if I first load and then unload a different model (Dracones/Midnight-Miqu-70B-v1.5_exl2_4.0bpw, also using 4-bit cache), subsequent loading of FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal works fine.

Loading of the other model apparently puts the right values in the cache.max_seq_len and everything works fine up to the full 32K context.

Can confirm

I am seeing the same issue, another side effect is that when loading a quantized model in ooba, it always identifies models as having 2k truncation_length regardless of the actual model parameters.

turboderp / exllamav2

Error after the generation. AssertionError: Total sequence length exceeds cache size in model.forward #105