Open jdwh08 opened 1 week ago
Hello there, @jdwh08! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together!
The Llama3 Cookbook for Llama3-8B-Instruct can indeed specify that Instruct is a chat model and include the necessary configuration to avoid prompt overruns and missing end-of-turn tokens. The provided code includes functions like messages_to_prompt_v3_instruct
and completion_to_prompt_v3_instruct
that handle the conversion of chat messages and completion instructions to the Llama 3 Instruct format, ensuring proper formatting and inclusion of end-of-turn tokens.
Here is the relevant code:
HEADER_SYS = "<|start_header_id|>system<|end_header_id|>\n\n"
HEADER_USER = "<|start_header_id|>user<|end_header_id|>\n\n"
HEADER_ASSIST = "<|start_header_id|>assistant<|end_header_id|>\n\n"
EOT = "<|eot_id|>\n"
def messages_to_prompt_v3_instruct(
messages: Sequence[ChatMessage], system_prompt: Optional[str] = None
) -> str:
"""
Convert a sequence of chat messages to Llama 3 Instruct format.
Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
Note: `<|begin_of_text|>` is not needed as Llama.cpp appears to add it already.
"""
string_messages: List[str] = []
if messages[0].role == MessageRole.SYSTEM:
# pull out the system message (if it exists in messages)
system_message_str = messages[0].content or ""
messages = messages[1:]
else:
system_message_str = system_prompt or DEFAULT_SYSTEM_PROMPT
# make sure system prompt is included at the start
string_messages.append(f"{HEADER_SYS}{system_message_str.strip()}{EOT}")
for i in range(0, len(messages), 2):
# first message should always be a user
user_message = messages[i]
assert user_message.role == MessageRole.USER
# include user message content
str_message = f"{HEADER_USER}{user_message.content}{EOT}"
if len(messages) > (i + 1):
# if assistant message exists, add to str_message
assistant_message = messages[i + 1]
assert assistant_message.role == MessageRole.ASSISTANT
str_message += f"{HEADER_ASSIST}{assistant_message.content}{EOT}"
string_messages.append(str_message)
# prompt the LLM to begin its response
string_messages.append(HEADER_ASSIST)
return "".join(string_messages)
def completion_to_prompt_v3_instruct(
completion: str, system_prompt: Optional[str] = None
) -> str:
"""
Convert completion instruction string to Llama 3 Instruct format.
Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
Note: `<|begin_of_text|>` is not needed as Llama.cpp appears to add it already.
"""
system_prompt_str = system_prompt or DEFAULT_SYSTEM_PROMPT
return (
f"{HEADER_SYS}{system_prompt_str.strip()}{EOT}"
f"{HEADER_USER}{completion.strip()}{EOT}"
f"{HEADER_ASSIST}"
)
This code ensures that the system prompt, user messages, and assistant messages are properly formatted and include the necessary end-of-turn tokens (<|eot_id|>
) [1].
To continue talking to Dosu, mention @dosu.
Documentation Issue Description
Hello world!
The Llama3 Cookbook for Llama3-8b-Instruct should specify that Instruct is a chat model; this is particularly important for ResponseSynthesizer/QueryEngine use cases to avoid missing end-of-turn tokens and long prompt overruns.
Currently, we have:
We can see the issues of not specifying the chat format when we run
which outputs
Notice how this results in a huge prompt overrun in the output, where the model starts generating new questions that are less relevant to the initial user prompt (What is Y Combinator, What are some of the companies that Y Combinator has backed?) and continues answering beyond the original prompt.
\ Why is this the case? Recall that Llama3-8b-Instruct has a special format (https://github.com/meta-llama/llama-recipes), of the form
This special format strongly suggests we should use the chat format from the model's tokenizer (i.e.,
apply_chat_template
).\ \ Why does this matter? Llama3-8B-Instruct is currently an important model because it fills the niche of being high quality, small, reputable, and instruction-tuned.
When we use Llama3-8B in a ReponseSynthesizer as part of a QueryEngine, e.g., Refine (https://github.com/run-llama/llama_index/blob/ff235407bd32a6ae82e55a9edf73a099cb132459/llama-index-core/llama_index/core/response_synthesizers/refine.py#L202)
it is running
get_program_for_llm
, which I believe eventually gets aLLMTextCompletionProgram
, which eventually leads to this choice: (https://github.com/run-llama/llama_index/blob/ff235407bd32a6ae82e55a9edf73a099cb132459/llama-index-core/llama_index/core/program/llm_program.py#L90)We probably want to be sure that Llama Index is running any queries through the
.chat
process, and not the.complete
process, to ensure the AutoTokenizer chat format is being applied, so that there aren't any major prompt overruns due to the <|eot_token|> tag not generating from bad input format.This was a major cause of suboptimal output due to prompt overruns in our RAG system based on LlamaIndex/Llama3-8B-Instruct.
\ Thanks!
- jdwh08
Documentation Link
https://docs.llamaindex.ai/en/stable/examples/cookbooks/llama3_cookbook/