[Phi3.5 Mini] Eval and text generation takes several times longer than with Gemma2 2B It

Rybens92 commented 2 months ago

The bug When using the Phi3.5 mini, eval and text generation takes several times longer than with Gemma2 2B It, even though they are similarly small models. On top of that, I am experiencing UserWarning: Self-consistency check in _cleanup_tokens() failed. I don't know if this makes a difference.

To Reproduce Use system(), user(), assistant() blocks with Phi? Not sure.

def run(self, input):

        stop_words = ["<reasoning>", "<code>", "<code_output>", "<step",
                    "</reasoning>", "</code>", "</code_output>", "</step",
                    "<|end|>", "<|eot_id|>", "<|end_of_turn|>", "<|im_end|>",
                    "<end_of_turn>",]

        tools_names = [tool.name for tool in self.tools]
        prompt = self.prompt.format(tools_doc=render_text_description_and_args(self.tools))

        outputs_to_return = []
        try:
            with system():
                llm = self.model + prompt
            with user():
                llm += input
        except:
            with user():
                llm = self.model + prompt + "\n\nUser: " + input

        for i in range(self.max_steps):
            step = i + 1
            with assistant():
                llm += f"<step{step}>"
                llm += "<reasoning>\n"
                with block("reasoning"):
                    llm += gen(stop=stop_words, max_tokens=300, temperature=0.1)
                    llm += "\nAfter further consideration, the tool I should use now is " + \
                            gen(stop=stop_words + ["\n"], max_tokens=100, temperature=0)
                llm += "</reasoning>\n"

                print("Reasoning: ", llm["reasoning"])

                llm += "<code>\n"
                with block("code"):
                    llm += "# When I use tools, I have to follow strictly the description of their inputs\n"

                    llm += select(tools_names, name = "tool_choice")

                    tool_to_use = llm["tool_choice"].strip(); tool_to_use = [tool for tool in self.tools if tool.name == tool_to_use][0]
                    tool_args_names = [key for key in tool_to_use.args.keys()]
                    tool_args_types = [value["type"] for value in tool_to_use.args.values()]
                    stops = [",", "\n"]
                    llm += "("

                    for i, arg_name in enumerate(tool_args_names):
                        llm += f"{arg_name}="
                        if "str" in tool_args_types[i]:
                            llm += '"' + gen(stop=['"', "\n"], max_tokens=100, temperature=0) + '"'
                        else:
                            llm += gen(stop=stops, max_tokens=100, temperature=0.1)

                        if i < len(tool_args_names) - 1:
                            llm += ", "

                    llm += ")\n"

                llm += "</code>\n"
                llm += f"</step{step}>\n"

                print("Code: ", llm["code"])

            if llm["tool_choice"].strip() == "final_answer":
                break

            with user():
                llm += "<tool_output>\n"
                output = self.python_env.run_code(llm["code"])["output"]
                outputs_to_return.append(output)
                llm += output
                llm += "</tool_output>\n"

                print("Output: ", output)

        return outputs_to_return

chat_phi3 = guidance.chat.Phi3MiniChatTemplate

#model_name = "gemma-2-2b-it-Q8_0.gguf"; chat_tmplt = chat_gemma
model_name = "Phi-3.5-mini-instruct-Q6_K.gguf"; chat_tmplt = chat_phi3
model = guidance.models.LlamaCpp(model_path, echo=False,
                         n_ctx = 8000, chat_template=chat_tmplt, verbose=True,
                         draft_model=LlamaPromptLookupDecoding(num_pred_tokens=2),
                         repeat_penalty = 1.0,
                         )

Same question answered takes on: Gemma2 2B It - 162.68 seconds Phi3.5 Mini - 887.36 seconds

Is Guidance processing all prompts once again while the Self-consistency check Warning occurs?

System info (please complete the following information):

Windows 11 WSL2 default Ubuntu
Guidance Version 0.1.16

Harsha-Nori commented 2 months ago

Hi @Rybens92, thank you so much for reporting this! You're exactly right -- a self consistency check means we're re-tokenizing the input (and re-filling the KV cache) as we're templating text. Phi-3's tokenizer is incredibly odd with it's handling of whitespace characters, where re-tokenizing the same text (text -> tokens -> text -> tokens) produces different tokenizations each time. I thought we patched this but it might not have caught all the edge cases.

Could I ask a few questions?

1) Does this still happen on the release candidate of guidance with our new parser? You can install this with:

pip install guidance --pre -U

and you should see guidance version 0.2.0rc1 being installed.

2) Where did you pull the .gguf files for Phi-3.5-mini and Gemma 2 2B-it? I'd like to make sure I'm using the same GGUFs when debugging on our side.

Rybens92 commented 2 months ago

Thank you, now after installing the pre version the generation time is 332.99 seconds. It is much better and UserWarning does not occur!

As for your second question, I try to use quants from @bartowski and for small models with Q8 or Q6.

Thanks again for your advice and for your help!

guidance-ai / guidance

[Phi3.5 Mini] Eval and text generation takes several times longer than with Gemma2 2B It #996