Open SavasAli opened 4 weeks ago
Hello, @SavasAli! I'm here to help you with any bugs, questions, or becoming a contributor. Let's work together to solve this issue.
To retrieve the top K unique base nodes effectively using the RecursiveRetriever
from LlamaIndex, you can utilize the _retrieve_rec
method, which already includes logic to handle recursive retrieval and remove duplicates based on node hashes. Here is an example of how you might implement this:
class RecursiveRetriever(BaseRetriever):
def __init__(self, root_id: str, retriever_dict: Dict[str, BaseRetriever], node_dict: Dict[str, IndexNode], query_engine_dict: Optional[Dict[str, BaseQueryEngine]] = None, k: int = 10, **kwargs):
super().__init__(**kwargs)
self._root_id = root_id
self._retriever_dict = retriever_dict
self._node_dict = node_dict
self._query_engine_dict = query_engine_dict or {}
self.k = k
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
retrieved_nodes, _ = self._retrieve_rec(query_bundle, query_id=None)
return retrieved_nodes
def retrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
self._check_callback_manager()
if isinstance(str_or_query_bundle, str):
query_bundle = QueryBundle(str_or_query_bundle)
else:
query_bundle = str_or_query_bundle
nodes = self._retrieve(query_bundle)
unique_nodes = self._remove_duplicates(nodes)
top_k_nodes = sorted(unique_nodes, key=lambda n: n.score, reverse=True)[:self.k]
return top_k_nodes
def _remove_duplicates(self, nodes: List[NodeWithScore]) -> List[NodeWithScore]:
seen = set()
unique_nodes = []
for node in nodes:
if node.node.id not in seen:
seen.add(node.node.id)
unique_nodes.append(node)
return unique_nodes
async def _aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
retrieved_nodes, _ = await self._retrieve_rec(query_bundle, query_id=None)
return retrieved_nodes
async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
self._check_callback_manager()
if isinstance(str_or_query_bundle, str):
query_bundle = QueryBundle(str_or_query_bundle)
else:
query_bundle = str_or_query_bundle
nodes = await self._aretrieve(query_bundle)
unique_nodes = self._remove_duplicates(nodes)
top_k_nodes = sorted(unique_nodes, key=lambda n: n.score, reverse=True)[:self.k]
return top_k_nodes
In this implementation:
RecursiveRetriever
class inherits from BaseRetriever
.retrieve
and aretrieve
methods are overridden to include sorting and selecting the top K unique nodes based on their scores._retrieve_rec
method is used to handle the recursive retrieval._remove_duplicates
method ensures that only unique nodes are considered.Additionally, you can refer to the LlamaIndex documentation for more detailed examples and usage scenarios, such as handling chunk references and metadata references to enhance the retrieval process [1][2][3][4][5].
Regarding your second question, the base nodes should be TextNode
and the reference nodes should be IndexNode
. This distinction helps in managing and retrieving nodes effectively.
To continue talking to Dosu, mention @dosu.
Do you have duplicate nodes/data? Retirved nodes are depuplicated if you have duplicates
No, I don't have duplicate nodes.
If you can provide some code to replicate this issue, like in a google colab, happy to dive deeper then
Hi @logan-markewich,
I have made a Google Colab notebook with public data instead trying to replicate the issue, didn't manage to replicate the issue yet unfortunately. When it's finished, I will share it share with you.
Can I share it using our live email?
Best, Savas
Question Validation
Question
Summary
I'm encountering an issue with retrieving the top K base nodes using the
RecursiveRetriever
from LlamaIndex. When I try to retrieve the top K base nodes, it retrieves less than or equal than K nodes. The base retriever retrieves K nodes, but the RecursiveRetriever selects the base nodes from these.Steps to Reproduce
I've followed the notebook example but modified it for my use case. Below is a minimum example to reproduce the issue.
Questions
Expected vs. Actual Results