Closed Rajmehta123 closed 5 months ago
Do you get any output from the model before that error message? My guess would be that either the max sequence length is not set in the right place (it should be after config.prepare()
or else your value will be overwritten as it reads defaults from the model config), or the input IDs plus the extra 512 tokens tokens exceed the cache size.
I am facing the same problem as @Rajmehta123.
Code:
class QuizGenerator():
def generate_template(self, question_type, format_instructions):
template = f"""
<s>[INST]You are an expert at creating scenario based {question_type} and answers based on the given text and documentation.
Your goal is to make users acquainted with the content of the text through the questions you generate.
Also, for each question, give the correct answer and don't use incomplete sentences as context.
Do not repeat the same question. Do not rephrase/jumble the same question multiple times.
You should also make sure to include the question and solution in one single post and not split them apart.
Make sure not to lose any important information.
Set the difficulty of the questions to {self.difficulty} out of easy/medium/difficult. Make sure all questions are set to the specified difficulty.
Create {self.number_of_questions_per_chunk} questions and return the correct answer. Do not produce more or less questions than asked.
The input text given is:
------------
{self.text}
------------
[/INST]</s>"""
return template
def generate_template_and_format(self, ):
match self.question_type.lower():
case "mcq":
response_schema = [
ResponseSchema(name="question", description="A multiple choice question generated from input text snippet."),
ResponseSchema(name="option_1", description="First option for the multiple choice question. Use this format: 'a) option'"),
ResponseSchema(name="option_2", description="Second option for the multiple choice question. Use this format: 'b) option'"),
ResponseSchema(name="option_3", description="Third option for the multiple choice question. Use this format: 'c) option'"),
ResponseSchema(name="option_4", description="Fourth option for the multiple choice question. Use this format: 'd) option'"),
ResponseSchema(name="answer", description="Correct answer for the question. Use this format: 'd) option' or 'b) option', etc.")
]
format_instructions = self.parse_output(response_schema)
# print(f"format_inst: {format_instructions}")
template = self.generate_template("Multiple Choice Questions", format_instructions)
return template
def create_llm_chain(self, ):
template, format_instructions = self.generate_template_and_format()
prompt = PromptTemplate(
input_variables=["text"],
template=template,
partial_variables={"format_instructions": format_instructions},
)
llm_chain = LLMChain(llm=self.llm, prompt=prompt)
# print(f"template: {template}")
# print(f"format_instructions: {format_instructions}")
return llm_chain
# Create questions based on the the entire input text when len(input_text) > LLM context window
def generate_qa(self, file_path=None, text=None):
if file_path and text:
print("Input either a file or text data. Not both.")
return
elif file_path:
chunks = self.file_processing(file_path, 5000, 50)
elif text:
# print(chunks[0])
# print(f"text: {text}")
chunks = self.text_splitter(text, 5000, 50)
else:
print("Please provide either a file path or text data.")
return
self.number_of_questions_per_chunk = math.ceil(self.number_of_questions / len(chunks))
print(f"Length of chunks: {len(chunks)}")
# llm_chain = self.create_llm_chain()
start = time.time()
results = []
for chunk in chunks:
self.text = chunk
prompt = self.generate_template_and_format()
# print(prompt)
instruction_ids = tokenizer.encode(f"[INST] {prompt} {chunk} [/INST]", add_bos=True)
context_ids = instruction_ids if generator.sequence_ids is None \
else torch.cat([generator.sequence_ids, instruction_ids], dim = -1)
generator.begin_stream(context_ids, gen_settings)
response = ""
while True:
chunk, eos, _ = generator.stream()
if eos: break
print(chunk, end = "")
sys.stdout.flush()
Error:
AssertionError Traceback (most recent call last)
[<ipython-input-19-bae211f1c8a9>](https://localhost:8080/#) in <cell line: 2>()
1 qa = QuizGenerator(model, "open-ended", 2, "difficult")
----> 2 text = qa.generate_qa(file_path="./ORCA User Types.pdf", prompt=prompt)
4 frames
[<ipython-input-18-a5cf9ef3cacc>](https://localhost:8080/#) in generate_qa(self, file_path, text, prompt)
178 response = ""
179 while True:
--> 180 chunk, eos, _ = generator.stream()
181 if eos: break
182 print(chunk, end = "")
[/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py](https://localhost:8080/#) in stream(self)
141 # Generate a single token and append to the sequence
142
--> 143 next_token, eos = self._gen_single_token(self.settings)
144
145 # End immediately if it was a stop token
[/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py](https://localhost:8080/#) in _gen_single_token(self, gen_settings, prefix_token)
309 if self.draft_model is None:
310
--> 311 logits = self.model.forward(self.sequence_ids[:, -1:], self.cache, loras = self.active_loras).float().cpu()
312 token, _, eos = ExLlamaV2Sampler.sample(logits, gen_settings, self.sequence_ids, random.random(), self.tokenizer, prefix_token)
313
[/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
116
117 return decorate_context
[/usr/local/lib/python3.10/dist-packages/exllamav2/model.py](https://localhost:8080/#) in forward(self, input_ids, cache, input_mask, preprocess_only, last_id_only, loras, return_last_state, position_offsets)
555
556 past_len = cache.current_seq_len
--> 557 assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
558
559 # Split sequence
AssertionError: Total sequence length exceeds cache size in model.forward
How do I set the max sequence length? Also, the model is not generating based on the prompt. I change the prompt to generate true-false questions and yet it produces multiple choice questions. Or I tell it to produce 2 questions and it still produces 10. When I run it without exllama, it follows my commands. Is my initialization wrong in anyway?
If you're using <s>
in the input to tokenizer.encode
, you need to call it with encode_special_tokens=True
and without add_bos=True
. You've also got multiple prompt templates there. First you add the [INST]
etc. tags to the prompt
string, then you're adding them again when encoding it.
You can set max_seq_len
in the config right after config.prepare
but before model.load
. The default value is whatever is specified in the model's config.json.
class MixtralTextGeneration():
def __init__(self):
model_directory = "/data/Prasanthi/text-generation-webui/models/turboderp_Mixtral-8x7B-instruct-exl2_3.0bpw"
config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()
# Initialize model and cache
self.model = ExLlamaV2(config)
self.cache = ExLlamaV2Cache(self.model, lazy = True)
self.model.load_autosplit(self.cache)
self.tokenizer = ExLlamaV2Tokenizer(config)
# Initialize generator
self.generator = ExLlamaV2StreamingGenerator(self.model, self.cache, self.tokenizer)
self.generator.set_stop_conditions([self.tokenizer.eos_token_id])
self.gen_settings = ExLlamaV2Sampler.Settings()
def generate_text(self,prompt):
prompt=prompt.strip()
# print(prompt)
instruction_ids = self.tokenizer.encode(f"[INST] {prompt} [/INST]", add_bos = True)
context_ids = instruction_ids if self.generator.sequence_ids is None \
else torch.cat([self.generator.sequence_ids, instruction_ids], dim = -1)
start_time=time.time()
self.generator.begin_stream(context_ids, self.gen_settings)
display_text=''
while True:
chunk, eos, _ = self.generator.stream()
if eos: break
# print(chunk, end = "")
display_text+=chunk
# sys.stdout.flush()
end_time=time.time()
print(display_text)
answer={}
answer['response']=str(display_text).strip()
answer['time']=round(end_time-start_time,2)
return answer
load_mixtral=MixtralTextGeneration()
This is my code and I am facing the same error: Traceback (most recent call last): File "/data/Prasanthi/myenv/lib/python3.11/site-packages/tornado/web.py", line 1784, in _execute result = method(*self.path_args, *self.path_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/mixtral_deploy.py", line 76, in post response=load_mixtral.generate_text(input) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/mixtral_deploy.py", line 50, in generate_text self.generator.begin_stream(context_ids, self.gen_settings) File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 88, in begin_stream self._gen_begin_reuse(input_ids, gen_settings) File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 283, in _gen_begin_reuse if reuse < in_tokens.shape[-1]: self._gen_feed_tokens(in_tokens[:, reuse:], gen_settings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 299, in _gen_feed_tokens self.model.forward(self.sequence_ids[:, start : -1], self.cache, preprocess_only = True, loras = self.active_loras) File "/data/Prasanthi/myenv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/Prasanthi/myenv/lib/python3.11/site-packages/exllamav2/model.py", line 557, in forward assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Total sequence length exceeds cache size in model.forward Could you please help to come out of this error?
I think this issue may have something to do with uninitialized config cache.
I can reliably reproduce the issue by trying to load (in oobabooga, using 4-bit cache) FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal
as the very first model after textui startup.
AFAICT, it reliably happens as soon as the input size reaches 2K. For the model above the limit should have been 32764.
However, if I first load and then unload a different model (Dracones/Midnight-Miqu-70B-v1.5_exl2_4.0bpw
, also using 4-bit cache), subsequent loading of FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal
works fine.
Loading of the other model apparently puts the right values in the cache.max_seq_len
and everything works fine up to the full 32K context.
I think this issue may have something to do with uninitialized config cache. I can reliably reproduce the issue by trying to load (in oobabooga, using 4-bit cache)
FluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal
as the very first model after textui startup.AFAICT, it reliably happens as soon as the input size reaches 2K. For the model above the limit should have been 32764.
However, if I first load and then unload a different model (
Dracones/Midnight-Miqu-70B-v1.5_exl2_4.0bpw
, also using 4-bit cache), subsequent loading ofFluffyKaeloky/Midnight-Miqu-103B-v1.5-exl2-3.0bpw-rpcal
works fine.Loading of the other model apparently puts the right values in the
cache.max_seq_len
and everything works fine up to the full 32K context.
Can confirm
I am seeing the same issue, another side effect is that when loading a quantized model in ooba, it always identifies models as having 2k truncation_length regardless of the actual model parameters.
This is my code. The input ids is less than the max seq len 4096.
Error:
Code: