run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.32k stars 5.18k forks source link

[Question]: Need help: Retrieval and generation not working well #15075

Open galvangoh opened 2 months ago

galvangoh commented 2 months ago

Question Validation

Question

Hello there, I have an use case for my company of which I set up a naive RAG + reranker application which extract table line items from documents such as a sales invoice into a structured format (a list of dictionaries where each dictionary is a line item extracted). Note that the end to end process of the application has zero human interaction so some assumptions have to be made sensibly in between, such as setting the top_k for similarity search and n for reranking. When this application goes live, it is expected to process a variety of invoices from all sorts of companies from all over the world.

My issue is during the generation part - Even though during querying the retrieval has reached out to all the correct nodes but I'm not sure why the LLM is responding with partial result. For my reproducible code below, I have 4 out of 6 nodes which contains table line items. The prompt is sent to each of the 4 nodes to retrieve all line items. But the response only contain line items of 1, sometimes 2 or 3 nodes, but never all of the 4 nodes.

My question is:

The end to end process in summary:

  1. LlamaParse parses the PDF document into raw markdown.
  2. Preprocess the markdown result (remove repeated parsing result) to eliminate LLM possibility to hallucinate.
  3. Chunk the preprocessed result, build an index, create query engine (with reranker) on top of the index.
  4. Send a user query that ask to extract all line items from tables.

Sample files: preprocessed_markdown.txt raw_markdown.txt

Reproducible codes:

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
from llama_index.core import Settings
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_KEY')

embed_model = OpenAIEmbedding(model='text-embedding-3-large', api_key=OPENAI_API_KEY)
llm = OpenAI(model='gpt-4-turbo-preview', temperature=0, api_key=OPENAI_API_KEY, timeout=300)
reranker = FlagEmbeddingReranker(top_n=2, model='BAAI/bge-reranker-large') # keep top_n as default

Settings.llm = llm
Settings.embed_model = embed_model

# load in data
# in actuality, I'm reading pdfs from Azure blob storage and parsing with LlamaParse
# and my own functions to preprocess llama parse result.
reader = SimpleDirectoryReader(input_files=['preprocessed_markdown.txt'])
document = reader.load_data()

# chunk data and build index
node_parser = MarkdownElementNodeParser(num_workers=8, show_progress=False)
nodes = node_parser.get_nodes_from_documents([document])
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
index =  VectorStoreIndex(nodes=base_nodes+objects)

# create query engine
num_pages = 4 # assuming all pages contains table
# add 2 to num_pages because I need to also reach out to context which does not exist in the table
recursive_query_engine = index.as_query_engine(similarity_top_k=num_pages+2,node_postprocessor=[reranker])

# user query
table_query = """..."""
# my prompt is quite long but its about extracting all table line items into a structured format like [{},{},{}], example below
# [{'OrderNumber': '2572814',
# 'Description': 'Transport charge',
# 'DeliveryDate': '2024-07-09',
# 'AmountTotalExclTax': 98.9}...]

# send query
response = recursive_query_engine.query(table_query)

Output when sending the query which the query hit all the correct nodes. Since each of these nodes context is only table, I'm also wondering if reranking is needed here.

Retrieval entering 10a50dd8-84d1-436c-87b7-e8614f80f20f: TextNode
Retrieval entering dc092b9d-5fee-403d-a277-52cc3a11dcaf: TextNode
Retrieval entering bc321be0-4cf3-49ca-9a58-c2b3930fee23: TextNode
Retrieval entering 5bcba882-f7a1-48cf-9635-d4c14869522d: TextNode

I print out to check if the query is retrieving from nodes that contain table (see below truncated output). Indeed is it true. And I check the final response but it only contain context from Node 1. Why is that?!

pprint_response(response_table, show_source=True)

Final Response: [ { "OrderNumber": "2572814", "Description": "Transport charge", "DeliveryDate": "2024-08-07", "AmountTotalExclTax": 98.90 },...


Source Node 1/6 Node ID: 10a50dd8-84d1-436c-87b7-e8614f80f20f Similarity: 0.46272718967789855 Text: The table details shipping orders, including origin and...


Source Node 2/6 Node ID: 761dd3fa-5ee9-4eaa-8bef-aeca785b3f4b Similarity: 0.4034459885044025 Text: DOCUMENT TYPE - Invoice LAYOUT TYPE - Freight DOCUMENT...


Source Node 3/6 Node ID: dc092b9d-5fee-403d-a277-52cc3a11dcaf Similarity: 0.3986738573151639 Text: This table provides a detailed breakdown of charges for a...


Source Node 4/6 Node ID: bc321be0-4cf3-49ca-9a58-c2b3930fee23 Similarity: 0.3914106907395891 Text: The table details shipping orders from ARGUS POWERBEND to...


Source Node 5/6 Node ID: 5bcba882-f7a1-48cf-9635-d4c14869522d Similarity: 0.38712992189260836 Text: The table lists shipping orders from Argus Powerbend to various...


Source Node 6/6 Node ID: 12fccda9-1554-49c5-8e56-2c0d1a3780cf Similarity: 0.25209234246916096 Text: TOTAL AMOUNT EXCLUDING TAX - 1,847.30 TOTAL TAX AMOUNT -...

dosubot[bot] commented 2 months ago

Your LLM is likely responding with partial results when querying for table line items from multiple nodes because it processes nodes in batches and reranks them based on relevance. The reranking process is necessary to ensure that the most relevant nodes are returned.

The LLMRerank class in the provided code reranks nodes using an LLM. It processes nodes in batches (choice_batch_size) and uses a prompt to select the most relevant nodes. The reranking ensures that only the top N nodes (as specified by top_n) are returned.

Here is the relevant part of the code that explains this behavior:

def _postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        if query_bundle is None:
            raise ValueError("Query bundle must be provided.")
        if len(nodes) == 0:
            return []

        initial_results: List[NodeWithScore] = []
        for idx in range(0, len(nodes), self.choice_batch_size):
            nodes_batch = [
                node.node for node in nodes[idx : idx + self.choice_batch_size]
            ]

            query_str = query_bundle.query_str
            fmt_batch_str = self._format_node_batch_fn(nodes_batch)
            # call each batch independently
            raw_response = self.service_context.llm.predict(
                self.choice_select_prompt,
                context_str=fmt_batch_str,
                query_str=query_str,
            )

            raw_choices, relevances = self._parse_choice_select_answer_fn(
                raw_response, len(nodes_batch)
            )
            choice_idxs = [int(choice) - 1 for choice in raw_choices]
            choice_nodes = [nodes_batch[idx] for idx in choice_idxs]
            relevances = relevances or [1.0 for _ in choice_nodes]
            initial_results.extend(
                [
                    NodeWithScore(node=node, score=relevance)
                    for node, relevance in zip(choice_nodes, relevances)
                ]
            )

        return sorted(initial_results, key=lambda x: x.score or 0.0, reverse=True)[
            : self.top_n
        ]

This function processes nodes in batches, uses the LLM to predict the most relevant nodes, and then sorts and returns the top N nodes based on their scores. This reranking step is crucial to ensure that the most relevant results are returned, which is why it is necessary in your pipeline [1][2].

Additionally, using UpTrain with LlamaIndex can help evaluate and improve the accuracy of your RAG (Retrieval-Augmented Generation) pipeline. UpTrain provides various evaluations such as Context Relevance, Factual Accuracy, and Response Completeness to ensure the quality of the retrieved context and generated responses. These evaluations can help identify and resolve issues in your pipeline, ensuring that the reranking process effectively improves the system's accuracy [3][4].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 months ago

Try a different response mode, try modifying the prompt, use a different llm

All valid options

galvangoh commented 2 months ago

@logan-markewich Thanks for your reply. And from the architecture point of view of the application set up, are there any pitfalls?

kuruvachankgeorge commented 1 month ago

The variable max_tokens can be set to a large enough number, like max_tokens=1024, when you initialise the llm. Eg: llm = OpenAI(model='gpt-4-turbo-preview', temperature=0, api_key=OPENAI_API_KEY, timeout=300, max_tokens=1024)

alouiadel commented 1 month ago

@logan-markewich I have a similar issue where the LLM only responds based on the first page of a PDF. Could you provide some key points to check in order to pinpoint the problem?

galvangoh commented 1 month ago

The variable max_tokens can be set to a large enough number, like max_tokens=1024, when you initialise the llm. Eg: llm = OpenAI(model='gpt-4-turbo-preview', temperature=0, api_key=OPENAI_API_KEY, timeout=300, max_tokens=1024)

Increasing max_tokens helps but I guess my prompts are too long for the LLM to understand.

alouiadel commented 1 month ago

for me it was to assure i am saving all objects returned by llamaparse