run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.37k stars 4.98k forks source link

[Question]: use multi-step query and the output is weird. #12237

Closed lambda7xx closed 2 months ago

lambda7xx commented 5 months ago

Question Validation

Question

I use the llama2 paper datasets. my code snippet is below

llm = HuggingFaceLLM(
    model_name="mistralai/Mistral-7B-Instruct-v0.1",
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.1",
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"torch_dtype": torch.bfloat16},
    generate_kwargs={"temperature": 0.2, "top_k": 5, "top_p": 0.95},
    device_map="auto",
)

Settings.llm = llm

step_decompose_transform = StepDecomposeQueryTransform(llm=Settings.llm, verbose=True)

documents = SimpleDirectoryReader("../data/llama2_paper/").load_data() #data/llmama2_paper.json ./data/survery/llm_survery_paper.json

index = VectorStoreIndex.from_documents(documents=documents)

query_engine = index.as_query_engine() 
query_engine = MultiStepQueryEngine(
            query_engine=query_engine,
            query_transform=step_decompose_transform)

 response = query_engine.query(query)

my original query is Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?

I add print lama_index/llms/huggingface/base.py

    @llm_completion_callback()
    def complete(
        self, prompt: str, formatted: bool = False, **kwargs: Any
    ) -> CompletionResponse:
        """Completion endpoint."""
        full_prompt = prompt
        def getlen(s):
            return len(s.split())

        print(f"1 complete, len(prompt): {getlen(prompt)}, \n prompt:{prompt} ")
        import time 
        start = time.time()
        if not formatted:
            if self.query_wrapper_prompt:
                full_prompt = self.query_wrapper_prompt.format(query_str=prompt)
            if self.system_prompt:
                full_prompt = f"{self.system_prompt} {full_prompt}"

        inputs = self._tokenizer(full_prompt, return_tensors="pt")
        inputs = inputs.to(self._model.device)
        end = time.time()
        print("2 complete, the time of tokenizer: ", end-start)
        # remove keys from the tokenizer if needed, to avoid HF errors
        for key in self.tokenizer_outputs_to_remove:
            if key in inputs:
                inputs.pop(key, None)
        start = time.time()
        tokens = self._model.generate(
            **inputs,
            max_new_tokens=self.max_new_tokens,
            stopping_criteria=self._stopping_criteria,
            **self.generate_kwargs,
        )#(lambda):insert torch.cuda.nvtx
        end = time.time()
        print("3 complete, the time of model.generate: ", end-start)
        completion_tokens = tokens[0][inputs["input_ids"].size(1) :]
        completion = self._tokenizer.decode(completion_tokens, skip_special_tokens=True)
        print("4 complete, len(completion): ", getlen(completion), " and type(completion): ", type(completion),"\n the complete: ", completion, "\n\n\n")
        print("************finish complete function************\n\n")

        return CompletionResponse(text=completion, raw={"model_output": tokens})

then i found there are 7 llm call. there are 3 weird llm call and 4 normal weird llm call. The log of weird llm call

1 complete, len(prompt): 388, 
 prompt:<s>[INST] The original question is as follows: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?
We have an opportunity to answer some, or all of the question from a knowledge source. Context information for the knowledge source is provided below, as well as previous reasoning steps.
Given the context and previous reasoning, return a question that can be answered from the context. This question can be the same as the original question, or this question can represent a subcomponent of the overall question.It should not be irrelevant to the original question.
If we cannot extract more information from the context, provide 'None' as the answer. Some examples are given below: 

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides names of the winners of the 2020 Australian Open
Previous reasoning: None
Next question: Who was the winner of the 2020 Australian Open? 

Question: Who was the winner of the 2020 Australian Open?
Knowledge source context: Provides names of the winners of the 2020 Australian Open
Previous reasoning: None.
New question: Who was the winner of the 2020 Australian Open? 

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides information about the winners of the 2020 Australian Open
Previous reasoning:
- Who was the winner of the 2020 Australian Open? 
- The winner of the 2020 Australian Open was Novak Djokovic.
New question: None

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides information about the winners of the 2020 Australian Open - includes biographical information for each winner
Previous reasoning:
- Who was the winner of the 2020 Australian Open? 
- The winner of the 2020 Australian Open was Novak Djokovic.
New question: How many Grand Slam titles does Novak Djokovic have? 

Question: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?
Knowledge source context: None
Previous reasoning: None
New question:  [/INST] </s>

2 complete, the time of tokenizer:  0.001664876937866211
3 complete, the time of model.generate:  1.5412354469299316
4 complete, len(completion):  24  and type(completion):  <class 'str'> 
 the complete:  
What are the two primary objectives achieved in the development of the large language models in "Llama 2: Open Foundation and Fine-Tuned Chat Models"? 

for the weird llm call, it seems the prompt is not related to the original query.

for the normal llm call, the log is very normal and the prompt is related to the original query. It retrieve the data related to the original query.

1 complete, len(prompt): 971, 
 prompt:<s>[INST] Context information is below.
---------------------
file_path: /home/ubuntu/uw-llama/multi-step-query/../data/llama2_paper/llama2_paper.json

The development and release of a collection of pretrained and fine-tuned large language models (LLMs) specifically optimized for dialogue use cases.\n2. The demonstration that these fine-tuned LLMs, referred to as Llama 2-Chat, outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models, particularly in terms of helpfulness and safety based on human evaluations.\n\nThe range of parameters for the large language models developed in this work is from 7 billion to 70 billion parameters.",
            "reference_answer_by": {
                "model_name": "gpt-4-1106-preview",
                "type": "ai"
            }
        },
        {
            "query": "According to the context information provided, who are the corresponding authors for the paper on Llama 2, and what is the significance of their contribution to the project?",
            "query_by": {
                "model_name": "gpt-4-1106-preview",
                "type": "ai"
            },
            "reference_contexts": [
                "Llama 2 : Open Foundation and Fine-Tuned Chat Models\nHugo Touvron\u2217Louis Martin\u2020Kevin Stone\u2020\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom\u2217\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-\nsource models. We provide a detailed description of our approach to fine-tuning and safety\nimprovements of Llama 2-Chat in order to enable the community to build on our work and\ncontribute to the responsible development of LLMs.\n\u2217Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n\u2020Second author\nContributions for all the authors can be found in Section A.1.arXiv:2307.09288v2  [cs.CL]  19 Jul 2023"
            ],
            "reference_answer": "According to the context information provided, the corresponding authors for the paper on Llama 2 are listed as Thomas Scialom and Hugo Touvron, with their emails provided in braces: {tscialom, htouvron}@meta.com. The significance of their contribution as corresponding authors typically includes leading the research project, overseeing the development of the work, coordinating the contributions of other authors, and handling the correspondence during the paper's submission, peer review, and publication process. They are often the point of contact for any questions or communications related to the paper.",
            "reference_answer_by": {
                "model_name": "gpt-4-1106-preview",
                "type": "ai"
            }
        },
        {
            "query": "Discuss the significance of the \"Reinforcement Learning with Human Feedback (RLHF)\" method described in section 3.2 of the \"llama2.pdf\" document.

file_path: /home/ubuntu/uw-llama/multi-step-query/../data/llama2_paper/llama2_paper.json

{
    "examples": [
        {
            "query": "Based on the abstract of \"Llama 2: Open Foundation and Fine-Tuned Chat Models,\" what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?",
            "query_by": {
                "model_name": "gpt-4-1106-preview",
                "type": "ai"
            },
            "reference_contexts": [
                "Llama 2 : Open Foundation and Fine-Tuned Chat Models\nHugo Touvron\u2217Louis Martin\u2020Kevin Stone\u2020\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom\u2217\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-\nsource models. We provide a detailed description of our approach to fine-tuning and safety\nimprovements of Llama 2-Chat in order to enable the community to build on our work and\ncontribute to the responsible development of LLMs.\n\u2217Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n\u2020Second author\nContributions for all the authors can be found in Section A.1.arXiv:2307.09288v2  [cs.CL]  19 Jul 2023"
            ],
            "reference_answer": "The two primary objectives achieved in the work described in the abstract of \"Llama 2: Open Foundation and Fine-Tuned Chat Models\" are:\n\n1. The development and release of a collection of pretrained and fine-tuned large language models (LLMs) specifically optimized for dialogue use cases.\n2. The demonstration that these fine-tuned LLMs, referred to as Llama 2-Chat, outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models, particularly in terms of helpfulness and safety based on human evaluations.\n\nThe range of parameters for the large language models developed in this work is from 7 billion to 70 billion parameters.",
            "reference_answer_by": {
                "model_name": "gpt-4-1106-preview",
                "type": "ai"
            }
        },
        {
            "query": "According to the context information provided, who are the corresponding authors for the paper on Llama 2, and what is the significance of their contribution to the project?
---------------------
Given the context information and not prior knowledge, answer the query.
Query: 
What are the two primary objectives achieved in the development of the large language models in "Llama 2: Open Foundation and Fine-Tuned Chat Models"?
Answer:  [/INST] </s>

2 complete, the time of tokenizer:  0.004996776580810547
3 complete, the time of model.generate:  8.022477865219116
4 complete, len(completion):  106 Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 and type(completion):  <class 'str'> 
 the complete:  
The two primary objectives achieved in the development of the large language models in "Llama 2: Open Foundation and Fine-Tuned Chat Models" are:

1. The development and release of a collection of pretrained and fine-tuned large language models (LLMs) specifically optimized for dialogue use cases.
2. The demonstration that these fine-tuned LLMs, referred to as Llama 2-Chat, outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models, particularly in terms of helpfulness and safety based on human evaluations.

The range of parameters for the large language models developed in this work is from 7 billion to 70 billion parameters. 
lambda7xx commented 5 months ago

why is there the wrird llm call?

dosubot[bot] commented 5 months ago

Hey @lambda7xx! Great to see you diving deep into the mysteries of LlamaIndex again. Hope you're doing well!

Based on the information you've provided, it seems like the "weird" LLM calls you're seeing are part of the multi-step query process. The LlamaIndex framework uses a MultiStepQueryEngine to handle complex queries that require multiple steps of reasoning or decomposition. This engine works by iteratively decomposing and processing each step of the query until a stopping condition is met.

The "weird" prompts you're seeing are likely generated by the query_transform function, which is responsible for transforming an original query into a new query that can be answered with the given context and, if applicable, previous reasoning steps. This transformation process is crucial for generating prompts for Large Language Model (LLM) calls in a way that optimizes the use of available information and facilitates the extraction of relevant answers.

For example, the prompt you provided:

"The original question is as follows: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed? We have an opportunity to answer some, or all of the question from a knowledge source. Context information for the knowledge source is provided below, as well as previous reasoning steps. Given the context and previous reasoning, return a question that can be answered from the context. This question can be the same as the original question, or this question can represent a subcomponent of the overall question. It should not be irrelevant to the original question. If we cannot extract more information from the context, provide 'None' as the answer. Some examples are given below: ..."

This is likely a result of the query_transform function decomposing the original query into a new query that can be answered with the given context and previous reasoning steps. The function is designed to iteratively refine queries based on the context and previous reasoning, enabling the generation of LLM prompts that are more likely to yield relevant and accurate answers.

The "normal" LLM calls you're seeing are likely the result of the complete function in the base.py file of the LlamaIndex codebase. This function constructs a payload with the provided prompt, model details, and additional options. It then makes a POST request to the Ollama API's /api/generate endpoint with this payload. Upon receiving a successful response, it extracts the text from the response and returns a CompletionResponse object containing the text, the raw response, and any additional keyword arguments excluding "response".

If you're finding that the "weird" LLM calls are causing issues with your multi-step queries, you might consider reviewing and possibly refining the logic in the messages_to_prompt function to ensure it accurately constructs the context for multi-step queries. You could also experiment with different context_window sizes to find an optimal balance that allows the model to maintain context without being overwhelmed by irrelevant information. Lastly, you could evaluate the training data and capabilities of the "mistralai/Mistral-7B-Instruct-v0.1" model to ensure it aligns with the types of queries and the domain of the dataset you are working with.

I hope this helps clarify the behavior you're seeing. If you have any more questions or need further clarification, please don't hesitate to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
lambda7xx commented 5 months ago

i think the below is not related to my query.

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides names of the winners of the 2020 Australian Open
Previous reasoning: None
Next question: Who was the winner of the 2020 Australian Open? 

Question: Who was the winner of the 2020 Australian Open?
Knowledge source context: Provides names of the winners of the 2020 Australian Open
Previous reasoning: None.
New question: Who was the winner of the 2020 Australian Open? 

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides information about the winners of the 2020 Australian Open
Previous reasoning:
- Who was the winner of the 2020 Australian Open? 
- The winner of the 2020 Australian Open was Novak Djokovic.
New question: None

Question: How many Grand Slam titles does the winner of the 2020 Australian Open have?
Knowledge source context: Provides information about the winners of the 2020 Australian Open - includes biographical information for each winner
Previous reasoning:
- Who was the winner of the 2020 Australian Open? 
- The winner of the 2020 Australian Open was Novak Djokovic.
New question: How many Grand Slam titles does Novak Djokovic have? 

Question: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?
Knowledge source context: None
Previous reasoning: None
New question:  [/INST] </s>
logan-markewich commented 5 months ago

It seems like your LLM just barfed while generating sub-queries (this "odd" query is a refine step, but the input to the refine step is part of the prompt for generating sub-queries?)

Its hard to know exactly where in your pipeline this happened without stepping through the code in a debugger

lambda7xx commented 5 months ago

It seems like your LLM just barfed while generating sub-queries (this "odd" query is a refine step, but the input to the refine step is part of the prompt for generating sub-queries?)

Its hard to know exactly where in your pipeline this happened without stepping through the code in a debugger

thank you so much.