run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.42k stars 4.68k forks source link

[Documentation]: Llama3-Instruct HuggingFaceLLM docs should specify it is a chat model. #14516

Open jdwh08 opened 1 week ago

jdwh08 commented 1 week ago

Documentation Issue Description

Hello world!

The Llama3 Cookbook for Llama3-8b-Instruct should specify that Instruct is a chat model; this is particularly important for ResponseSynthesizer/QueryEngine use cases to avoid missing end-of-turn tokens and long prompt overruns.

Currently, we have:

llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
    # missing 'is_chat_model'=True
)

We can see the issues of not specifying the chat format when we run

response = llm.complete("Who is Paul Graham?")
print(response)

which outputs

Paul Graham is an American entrepreneur, venture capitalist, and author. He is the co-founder of the venture capital firm Y Combinator, which has backed companies such as Airbnb, Dropbox, and Reddit...
What is Y Combinator? Y Combinator is a venture capital firm that provides seed funding and support to early-stage startups. [continued for several more sections...]

Notice how this results in a huge prompt overrun in the output, where the model starts generating new questions that are less relevant to the initial user prompt (What is Y Combinator, What are some of the companies that Y Combinator has backed?) and continues answering beyond the original prompt.

\ Why is this the case? Recall that Llama3-8b-Instruct has a special format (https://github.com/meta-llama/llama-recipes), of the form

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|>

This special format strongly suggests we should use the chat format from the model's tokenizer (i.e., apply_chat_template).

\ \ Why does this matter? Llama3-8B-Instruct is currently an important model because it fills the niche of being high quality, small, reputable, and instruction-tuned.

When we use Llama3-8B in a ReponseSynthesizer as part of a QueryEngine, e.g., Refine (https://github.com/run-llama/llama_index/blob/ff235407bd32a6ae82e55a9edf73a099cb132459/llama-index-core/llama_index/core/response_synthesizers/refine.py#L202)

it is running get_program_for_llm, which I believe eventually gets a LLMTextCompletionProgram, which eventually leads to this choice: (https://github.com/run-llama/llama_index/blob/ff235407bd32a6ae82e55a9edf73a099cb132459/llama-index-core/llama_index/core/program/llm_program.py#L90)

We probably want to be sure that Llama Index is running any queries through the .chat process, and not the .complete process, to ensure the AutoTokenizer chat format is being applied, so that there aren't any major prompt overruns due to the <|eot_token|> tag not generating from bad input format.

This was a major cause of suboptimal output due to prompt overruns in our RAG system based on LlamaIndex/Llama3-8B-Instruct.

\ Thanks! - jdwh08

Documentation Link

https://docs.llamaindex.ai/en/stable/examples/cookbooks/llama3_cookbook/

dosubot[bot] commented 1 week ago

Hello there, @jdwh08! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together!

The Llama3 Cookbook for Llama3-8B-Instruct can indeed specify that Instruct is a chat model and include the necessary configuration to avoid prompt overruns and missing end-of-turn tokens. The provided code includes functions like messages_to_prompt_v3_instruct and completion_to_prompt_v3_instruct that handle the conversion of chat messages and completion instructions to the Llama 3 Instruct format, ensuring proper formatting and inclusion of end-of-turn tokens.

Here is the relevant code:

HEADER_SYS = "<|start_header_id|>system<|end_header_id|>\n\n"
HEADER_USER = "<|start_header_id|>user<|end_header_id|>\n\n"
HEADER_ASSIST = "<|start_header_id|>assistant<|end_header_id|>\n\n"
EOT = "<|eot_id|>\n"

def messages_to_prompt_v3_instruct(
    messages: Sequence[ChatMessage], system_prompt: Optional[str] = None
) -> str:
    """
    Convert a sequence of chat messages to Llama 3 Instruct format.

    Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

    Note: `<|begin_of_text|>` is not needed as Llama.cpp appears to add it already.
    """
    string_messages: List[str] = []
    if messages[0].role == MessageRole.SYSTEM:
        # pull out the system message (if it exists in messages)
        system_message_str = messages[0].content or ""
        messages = messages[1:]
    else:
        system_message_str = system_prompt or DEFAULT_SYSTEM_PROMPT

    # make sure system prompt is included at the start
    string_messages.append(f"{HEADER_SYS}{system_message_str.strip()}{EOT}")

    for i in range(0, len(messages), 2):
        # first message should always be a user
        user_message = messages[i]
        assert user_message.role == MessageRole.USER
        # include user message content
        str_message = f"{HEADER_USER}{user_message.content}{EOT}"

        if len(messages) > (i + 1):
            # if assistant message exists, add to str_message
            assistant_message = messages[i + 1]
            assert assistant_message.role == MessageRole.ASSISTANT
            str_message += f"{HEADER_ASSIST}{assistant_message.content}{EOT}"

        string_messages.append(str_message)

    # prompt the LLM to begin its response
    string_messages.append(HEADER_ASSIST)

    return "".join(string_messages)

def completion_to_prompt_v3_instruct(
    completion: str, system_prompt: Optional[str] = None
) -> str:
    """
    Convert completion instruction string to Llama 3 Instruct format.

    Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

    Note: `<|begin_of_text|>` is not needed as Llama.cpp appears to add it already.
    """
    system_prompt_str = system_prompt or DEFAULT_SYSTEM_PROMPT

    return (
        f"{HEADER_SYS}{system_prompt_str.strip()}{EOT}"
        f"{HEADER_USER}{completion.strip()}{EOT}"
        f"{HEADER_ASSIST}"
    )

This code ensures that the system prompt, user messages, and assistant messages are properly formatted and include the necessary end-of-turn tokens (<|eot_id|>) [1].

To continue talking to Dosu, mention @dosu.