run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.48k stars 5.21k forks source link

[Question]: is there a way to return top k chunks in the order they appear in original document #14212

Open vkaul11 opened 4 months ago

vkaul11 commented 4 months ago

Question Validation

Question

I want to use one document and get the positions of the indices in the document and arrange the top k to the llm in the order that they appear in the original document not by the similarity score. Is there a way to do so ? It would be really helpful for the llm.

dosubot[bot] commented 4 months ago

Yes, there is a way to return the top k chunks in the order they appear in the original document rather than by similarity score. You can achieve this by leveraging the SummaryIndex class, which stores nodes in a sequence and iterates through them during query time.

Here's a method to do this:

  1. Retrieve the top k chunks based on similarity scores.
  2. Extract the positions of these chunks in the original document.
  3. Sort these chunks based on their original positions.

Below is a sample implementation:

def get_top_k_chunks_in_original_order(summary_index: SummaryIndex, top_k: int, query: str) -> List[BaseNode]:
    # Step 1: Retrieve top k chunks based on similarity scores
    retriever = SummaryIndexEmbeddingRetriever(index=summary_index, similarity_top_k=top_k)
    query_bundle = QueryBundle(query_str=query)
    top_k_chunks_with_scores = retriever._retrieve(query_bundle)

    # Step 2: Extract positions of these chunks in the original document
    chunk_positions = [(chunk.node, summary_index.index_struct.nodes.index(chunk.node.node_id)) for chunk in top_k_chunks_with_scores]

    # Step 3: Sort chunks based on their original positions
    sorted_chunks = sorted(chunk_positions, key=lambda x: x[1])

    # Return only the chunks, not their positions
    return [chunk for chunk, position in sorted_chunks]

# Example usage
summary_index = SummaryIndex(nodes=your_nodes)
top_k_chunks = get_top_k_chunks_in_original_order(summary_index, top_k=5, query="your query")

This code assumes that the retrieve method of the SummaryIndexEmbeddingRetriever returns a list of NodeWithScore objects and that summary_index.index_struct.nodes contains the node IDs in their original order. Adjust the implementation as needed based on the actual methods and data structures in your environment [1][2][3].

To continue talking to Dosu, mention @dosu.

tonypeng1 commented 3 months ago

You can use the "start_char_idx" in a TextNode to arrange the retrieved nodes in the order they appear in the original document. A node that appears earlier in the document has a smaller value of "start_char_idx".

First, you define a llama index "Custom Node PostProcessor" class (the instruction can be found in this link. https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/) to sort the nodes according to the value of "start_char_idx". The code is as follows,

from llama_index.core import QueryBundle
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore

class SortNodePostprocessor(BaseNodePostprocessor):
    def _postprocess_nodes(
            self, 
            nodes: List[NodeWithScore], 
            query_bundle: Optional[QueryBundle]
            ) -> List[NodeWithScore]:

        # Custom post-processor: Order nodes based on the order they appear in a document (using "start_char_idx")

        # Create a new list of node dictionaries
        _nodes_dic = [{"start_char_idx": node.node.start_char_idx, "node": node} for node in nodes]

        # Sort based on "start_char_idx"
        sorted_nodes_dic = sorted(_nodes_dic, key=lambda x: x["start_char_idx"])

        # Get the new sorted list of nodes
        sorted_new_nodes = [node["node"] for node in sorted_nodes_dic]

        return sorted_new_nodes

Second, create a query engine with this new SortNodePostProcessor() as the "node_postprocessors". The code is as follows. (You can replace the fusion retriever below with another type of retriever.)

sort_engine = RetrieverQueryEngine.from_args(
    retriever=fusion_retriever, 
    node_postprocessors=[SortNodePostprocessor()],
    )

When this engine is used on a query, the original retrieved nodes will be sorted by the order they appear in the original document before being fed to the LLM.

sort_response = sort_engine.query(query_str)

You can check the newly ordered nodes by typing,

print(sort_response.get_formatted_sources(length=2000))

I found this custom node postprocessor helpful in improving the answer to the question "What are the things that happen in New York?" in the document "paul_graham_essay.pdf" (since the author was in New York in different years). I use this postprocessor to sort the 5 retrieved nodes by the fusion retriever (originally not ordered as they appear in the document), making the answer more reasonable according to the time sequence of the events in New York.

(PS: This approach works with SentenceSplitter(), where "start_char_idx" is in sequential order. However, it does not work with HierarchicalNodeParser(), where "start_char_idx" appears to be reassigned in some nodes.)

dosubot[bot] commented 1 day ago

Hi, @vkaul11. I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale.

Your issue raised the need for a method to retrieve the top k chunks from a document while preserving their original order. In response, solutions were provided using the SummaryIndex class and the start_char_idx attribute in a custom node postprocessor, both of which effectively address your request.

Could you please let us know if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, feel free to comment here. Otherwise, you can close the issue yourself, or it will be automatically closed in 7 days. Thank you!