[Documentation]: Llama3-Instruct HuggingFaceLLM docs should specify it is a chat model.

run-llama / llama_index

LlamaIndex is a data framework for your LLM applications

MIT License

33.42k stars 4.68k forks source link

Documentation Issue Description

Hello world!

The Llama3 Cookbook for Llama3-8b-Instruct should specify that Instruct is a chat model; this is particularly important for ResponseSynthesizer/QueryEngine use cases to avoid missing end-of-turn tokens and long prompt overruns.

Currently, we have:

llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
    # missing 'is_chat_model'=True
)

We can see the issues of not specifying the chat format when we run

response = llm.complete("Who is Paul Graham?")
print(response)

which outputs

Paul Graham is an American entrepreneur, venture capitalist, and author. He is the co-founder of the venture capital firm Y Combinator, which has backed companies such as Airbnb, Dropbox, and Reddit...
What is Y Combinator? Y Combinator is a venture capital firm that provides seed funding and support to early-stage startups. [continued for several more sections...]

Notice how this results in a huge prompt overrun in the output, where the model starts generating new questions that are less relevant to the initial user prompt (What is Y Combinator, What are some of the companies that Y Combinator has backed?) and continues answering beyond the original prompt.

\ Why is this the case? Recall that Llama3-8b-Instruct has a special format (https://github.com/meta-llama/llama-recipes), of the form

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|>

This special format strongly suggests we should use the chat format from the model's tokenizer (i.e., apply_chat_template).

\ \ Why does this matter? Llama3-8B-Instruct is currently an important model because it fills the niche of being high quality, small, reputable, and instruction-tuned.

When we use Llama3-8B in a ReponseSynthesizer as part of a QueryEngine, e.g., Refine (https://github.com/run-llama/llama_index/blob/ff235407bd32a6ae82e55a9edf73a099cb132459/llama-index-core/llama_index/core/response_synthesizers/refine.py#L202)

it is running get_program_for_llm, which I believe eventually gets a LLMTextCompletionProgram, which eventually leads to this choice: (https://github.com/run-llama/llama_index/blob/ff235407bd32a6ae82e55a9edf73a099cb132459/llama-index-core/llama_index/core/program/llm_program.py#L90)

We probably want to be sure that Llama Index is running any queries through the .chat process, and not the .complete process, to ensure the AutoTokenizer chat format is being applied, so that there aren't any major prompt overruns due to the <|eot_token|> tag not generating from bad input format.

This was a major cause of suboptimal output due to prompt overruns in our RAG system based on LlamaIndex/Llama3-8B-Instruct.

\ Thanks! - jdwh08

Documentation Link

https://docs.llamaindex.ai/en/stable/examples/cookbooks/llama3_cookbook/

HEADER_SYS = "<|start_header_id|>system<|end_header_id|>\n\n" HEADER_USER = "<|start_header_id|>user<|end_header_id|>\n\n" HEADER_ASSIST = "<|start_header_id|>assistant<|end_header_id|>\n\n" EOT = "<|eot_id|>\n" def messages_to_prompt_v3_instruct( messages: Sequence[ChatMessage], system_prompt: Optional[str] = None ) -> str: """ Convert a sequence of chat messages to Llama 3 Instruct format. Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ Note: `<|begin_of_text|>` is not needed as Llama.cpp appears to add it already. """ string_messages: List[str] = [] if messages[0].role == MessageRole.SYSTEM: # pull out the system message (if it exists in messages) system_message_str = messages[0].content or "" messages = messages[1:] else: system_message_str = system_prompt or DEFAULT_SYSTEM_PROMPT # make sure system prompt is included at the start string_messages.append(f"{HEADER_SYS}{system_message_str.strip()}{EOT}") for i in range(0, len(messages), 2): # first message should always be a user user_message = messages[i] assert user_message.role == MessageRole.USER # include user message content str_message = f"{HEADER_USER}{user_message.content}{EOT}" if len(messages) > (i + 1): # if assistant message exists, add to str_message assistant_message = messages[i + 1] assert assistant_message.role == MessageRole.ASSISTANT str_message += f"{HEADER_ASSIST}{assistant_message.content}{EOT}" string_messages.append(str_message) # prompt the LLM to begin its response string_messages.append(HEADER_ASSIST) return "".join(string_messages) def completion_to_prompt_v3_instruct( completion: str, system_prompt: Optional[str] = None ) -> str: """ Convert completion instruction string to Llama 3 Instruct format. Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ Note: `<|begin_of_text|>` is not needed as Llama.cpp appears to add it already. """ system_prompt_str = system_prompt or DEFAULT_SYSTEM_PROMPT return ( f"{HEADER_SYS}{system_prompt_str.strip()}{EOT}" f"{HEADER_USER}{completion.strip()}{EOT}" f"{HEADER_ASSIST}" )

run-llama / llama_index

[Documentation]: Llama3-Instruct HuggingFaceLLM docs should specify it is a chat model. #14516

Documentation Issue Description

Documentation Link