[Question]: Issue with Retrieving Top K Base Nodes Using RecursiveRetriever

SavasAli commented 4 weeks ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Summary

I'm encountering an issue with retrieving the top K base nodes using the RecursiveRetriever from LlamaIndex. When I try to retrieve the top K base nodes, it retrieves less than or equal than K nodes. The base retriever retrieves K nodes, but the RecursiveRetriever selects the base nodes from these.

Steps to Reproduce

I've followed the notebook example but modified it for my use case. Below is a minimum example to reproduce the issue.

import copy
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode, TextNode
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.llms.openai import OpenAI
from llama_index.core.embeddings import resolve_embed_model

import os

import nest_asyncio

nest_asyncio.apply()

def main():

    embed_model = resolve_embed_model("local:BAAI/bge-small-en")

    introduction_llama2 = """Introduction Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in
    complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized
    domains such as programming and creative writing. They enable interaction with humans through intuitive
    chat interfaces, which has led to rapid and widespread adoption among the general public.
    The capabilities of LLMs are remarkable considering the seemingly straightforward nature of the training
    methodology. Auto-regressive transformers are pretrained on an extensive corpus of self-supervised data,
    followed by alignment with human preferences via techniques such as Reinforcement Learning with Human
    Feedback (RLHF). Although the training methodology is simple, high computational requirements have
    limited the development of LLMs to a few players. There have been public releases of pretrained LLMs
    (such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that
    match the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla
    (Hoffmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such
    as ChatGPT, BARD, and Claude. These closed product LLMs are heavily fine-tuned to align with human
    preferences, which greatly enhances their usability and safety. This step can require significant costs in
    compute and human annotation, and is often not transparent or easily reproducible, limiting progress within
    the community to advance AI alignment research.
    In this work, we develop and release Llama 2, a family of pretrained and fine-tuned LLMs, Llama 2 and
    Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested,
    Llama 2-Chat models generally perform better than existing open-source models. They also appear to
    be on par with some of the closed-source models, at least on the human evaluations we performed (see
    Figures 1 and 3). We have taken measures to increase the safety of these models, using safety-specific data
    annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally,
    this paper contributes a thorough description of our fine-tuning methodology and approach to improving
    LLM safety. We hope that this openness will enable the community to reproduce fine-tuned LLMs and
    continue to improve the safety of those models, paving the way for more responsible development of LLMs.
    We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as
    the emergence of tool usage and temporal organization of knowledge."""

    # Make sure to set your OpenAI API key
    # os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

    docs = [Document(text=introduction_llama2)]

    node_parser = SentenceSplitter(chunk_size=256)
    base_nodes = node_parser.get_nodes_from_documents(docs)

    # set node ids to be a constant
    for idx, node in enumerate(base_nodes):
        node.id_ = f"node-{idx}"

    extractors = [
        SummaryExtractor(summaries=["self"], show_progress=True),
        QuestionsAnsweredExtractor(questions=5, show_progress=True),
    ]

    all_nodes = copy.deepcopy(base_nodes)
    node_to_metadata = {node.id_: {'summary': 'Summary of Llama2', 'questions': 'What is Llama2?'} for node in base_nodes}
    for node_id, metadata in node_to_metadata.items():
        for val in metadata.values():
            all_nodes.append(IndexNode(text=val, index_id=node_id))

    vector_index_metadata = VectorStoreIndex(all_nodes)

    vector_retriever_metadata = vector_index_metadata.as_retriever(
        similarity_top_k=4
    )

    # Filtering function to get top K unique base nodes
    def retrieve_base_nodes(retriever, query, top_k):
        retrieved_nodes = retriever.retrieve(query)

        unique_base_nodes = []
        seen_base_node_ids = set()

        for node in retrieved_nodes:
            if isinstance(node, TextNode) and node.node_id not in seen_base_node_ids:
                unique_base_nodes.append(node)
                seen_base_node_ids.add(node.node_id)

            if len(unique_base_nodes) >= top_k:
                break

        return unique_base_nodes

    retrieved_base_nodes = retrieve_base_nodes(vector_retriever_metadata, "What is the purpose of this paper?", top_k=4)

    return retrieved_base_nodes

if __name__ == "__main__":
    main()

Questions

How can I retrieve the top K (unique) base nodes effectively?
Should all nodes be IndexNode or should the base nodes be TextNode and the reference nodes IndexNode?

Expected vs. Actual Results

Expected Result: Retrieve the top K unique base nodes.
Actual Result: Retrieves nodes including both base and reference nodes, and then selects a set of unique base nodes from this list.

dosubot[bot] commented 4 weeks ago

Hello, @SavasAli! I'm here to help you with any bugs, questions, or becoming a contributor. Let's work together to solve this issue.

To retrieve the top K unique base nodes effectively using the RecursiveRetriever from LlamaIndex, you can utilize the _retrieve_rec method, which already includes logic to handle recursive retrieval and remove duplicates based on node hashes. Here is an example of how you might implement this:

class RecursiveRetriever(BaseRetriever):
    def __init__(self, root_id: str, retriever_dict: Dict[str, BaseRetriever], node_dict: Dict[str, IndexNode], query_engine_dict: Optional[Dict[str, BaseQueryEngine]] = None, k: int = 10, **kwargs):
        super().__init__(**kwargs)
        self._root_id = root_id
        self._retriever_dict = retriever_dict
        self._node_dict = node_dict
        self._query_engine_dict = query_engine_dict or {}
        self.k = k

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        retrieved_nodes, _ = self._retrieve_rec(query_bundle, query_id=None)
        return retrieved_nodes

    def retrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
        self._check_callback_manager()
        if isinstance(str_or_query_bundle, str):
            query_bundle = QueryBundle(str_or_query_bundle)
        else:
            query_bundle = str_or_query_bundle

        nodes = self._retrieve(query_bundle)
        unique_nodes = self._remove_duplicates(nodes)
        top_k_nodes = sorted(unique_nodes, key=lambda n: n.score, reverse=True)[:self.k]
        return top_k_nodes

    def _remove_duplicates(self, nodes: List[NodeWithScore]) -> List[NodeWithScore]:
        seen = set()
        unique_nodes = []
        for node in nodes:
            if node.node.id not in seen:
                seen.add(node.node.id)
                unique_nodes.append(node)
        return unique_nodes

    async def _aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        retrieved_nodes, _ = await self._retrieve_rec(query_bundle, query_id=None)
        return retrieved_nodes

    async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
        self._check_callback_manager()
        if isinstance(str_or_query_bundle, str):
            query_bundle = QueryBundle(str_or_query_bundle)
        else:
            query_bundle = str_or_query_bundle

        nodes = await self._aretrieve(query_bundle)
        unique_nodes = self._remove_duplicates(nodes)
        top_k_nodes = sorted(unique_nodes, key=lambda n: n.score, reverse=True)[:self.k]
        return top_k_nodes

In this implementation:

The RecursiveRetriever class inherits from BaseRetriever.
The retrieve and aretrieve methods are overridden to include sorting and selecting the top K unique nodes based on their scores.
The _retrieve_rec method is used to handle the recursive retrieval.
The _remove_duplicates method ensures that only unique nodes are considered.

Additionally, you can refer to the LlamaIndex documentation for more detailed examples and usage scenarios, such as handling chunk references and metadata references to enhance the retrieval process ^[1]^[2]^[3]^[4]^[5].

Regarding your second question, the base nodes should be TextNode and the reference nodes should be IndexNode. This distinction helps in managing and retrieving nodes effectively.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 weeks ago

Do you have duplicate nodes/data? Retirved nodes are depuplicated if you have duplicates

SavasAli commented 4 weeks ago

No, I don't have duplicate nodes.

logan-markewich commented 4 weeks ago

If you can provide some code to replicate this issue, like in a google colab, happy to dive deeper then

SavasAli commented 3 weeks ago

Hi @logan-markewich,

I have made a Google Colab notebook with public data instead trying to replicate the issue, didn't manage to replicate the issue yet unfortunately. When it's finished, I will share it share with you.

Can I share it using our live email?

Best, Savas

run-llama / llama_index