Open vkaul11 opened 4 months ago
Yes, there is a way to return the top k chunks in the order they appear in the original document rather than by similarity score. You can achieve this by leveraging the SummaryIndex
class, which stores nodes in a sequence and iterates through them during query time.
Here's a method to do this:
Below is a sample implementation:
def get_top_k_chunks_in_original_order(summary_index: SummaryIndex, top_k: int, query: str) -> List[BaseNode]:
# Step 1: Retrieve top k chunks based on similarity scores
retriever = SummaryIndexEmbeddingRetriever(index=summary_index, similarity_top_k=top_k)
query_bundle = QueryBundle(query_str=query)
top_k_chunks_with_scores = retriever._retrieve(query_bundle)
# Step 2: Extract positions of these chunks in the original document
chunk_positions = [(chunk.node, summary_index.index_struct.nodes.index(chunk.node.node_id)) for chunk in top_k_chunks_with_scores]
# Step 3: Sort chunks based on their original positions
sorted_chunks = sorted(chunk_positions, key=lambda x: x[1])
# Return only the chunks, not their positions
return [chunk for chunk, position in sorted_chunks]
# Example usage
summary_index = SummaryIndex(nodes=your_nodes)
top_k_chunks = get_top_k_chunks_in_original_order(summary_index, top_k=5, query="your query")
This code assumes that the retrieve
method of the SummaryIndexEmbeddingRetriever
returns a list of NodeWithScore
objects and that summary_index.index_struct.nodes
contains the node IDs in their original order. Adjust the implementation as needed based on the actual methods and data structures in your environment [1][2][3].
To continue talking to Dosu, mention @dosu.
You can use the "start_char_idx" in a TextNode to arrange the retrieved nodes in the order they appear in the original document. A node that appears earlier in the document has a smaller value of "start_char_idx".
First, you define a llama index "Custom Node PostProcessor" class (the instruction can be found in this link. https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/) to sort the nodes according to the value of "start_char_idx". The code is as follows,
from llama_index.core import QueryBundle
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore
class SortNodePostprocessor(BaseNodePostprocessor):
def _postprocess_nodes(
self,
nodes: List[NodeWithScore],
query_bundle: Optional[QueryBundle]
) -> List[NodeWithScore]:
# Custom post-processor: Order nodes based on the order they appear in a document (using "start_char_idx")
# Create a new list of node dictionaries
_nodes_dic = [{"start_char_idx": node.node.start_char_idx, "node": node} for node in nodes]
# Sort based on "start_char_idx"
sorted_nodes_dic = sorted(_nodes_dic, key=lambda x: x["start_char_idx"])
# Get the new sorted list of nodes
sorted_new_nodes = [node["node"] for node in sorted_nodes_dic]
return sorted_new_nodes
Second, create a query engine with this new SortNodePostProcessor() as the "node_postprocessors". The code is as follows. (You can replace the fusion retriever below with another type of retriever.)
sort_engine = RetrieverQueryEngine.from_args(
retriever=fusion_retriever,
node_postprocessors=[SortNodePostprocessor()],
)
When this engine is used on a query, the original retrieved nodes will be sorted by the order they appear in the original document before being fed to the LLM.
sort_response = sort_engine.query(query_str)
You can check the newly ordered nodes by typing,
print(sort_response.get_formatted_sources(length=2000))
I found this custom node postprocessor helpful in improving the answer to the question "What are the things that happen in New York?" in the document "paul_graham_essay.pdf" (since the author was in New York in different years). I use this postprocessor to sort the 5 retrieved nodes by the fusion retriever (originally not ordered as they appear in the document), making the answer more reasonable according to the time sequence of the events in New York.
(PS: This approach works with SentenceSplitter(), where "start_char_idx" is in sequential order. However, it does not work with HierarchicalNodeParser(), where "start_char_idx" appears to be reassigned in some nodes.)
Hi, @vkaul11. I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale.
Your issue raised the need for a method to retrieve the top k chunks from a document while preserving their original order. In response, solutions were provided using the SummaryIndex
class and the start_char_idx
attribute in a custom node postprocessor, both of which effectively address your request.
Could you please let us know if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, feel free to comment here. Otherwise, you can close the issue yourself, or it will be automatically closed in 7 days. Thank you!
Question Validation
Question
I want to use one document and get the positions of the indices in the document and arrange the top k to the llm in the order that they appear in the original document not by the similarity score. Is there a way to do so ? It would be really helpful for the llm.