[Feature Request]: AutoMergingRetriever using two Supabase VectorStoreIndex

rahulkrprajapati commented 4 months ago

Feature Description

Since supabase as a docStore is not supported. I tried implementing a AutoMergingRetriever using two VectorStoreIndex.

Changes I made:

added get_document to SupabaseVectorStore using:

def get_document(self, doc_id: str, **kwargs: Any) -> Any:
        """Get document by doc id.

        Args:
            doc_id (str): document id
        """
        filters = {"doc_id": {"$eq": doc_id}}

        results = self._collection.query(
            data=None,
            filters=filters,
            include_value=True,
            include_metadata=True,
            **kwargs,
        )

        if len(results) > 0:
            id_, distance, metadata = results[0]
            text = metadata.pop("text", None)

            try:
                node = metadata_dict_to_node(metadata)
            except Exception:
                # NOTE: deprecated legacy logic for backward compatibility
                metadata, node_info, relationships = legacy_metadata_dict_to_node(
                    metadata
                )
                node = TextNode(
                    id_=id_,
                    text=text,
                    metadata=metadata,
                    start_char_idx=node_info.get("start", None),
                    end_char_idx=node_info.get("end", None),
                    relationships=relationships,
                )

            return node

        return None

changed two lines in auto_merging_retriever.py:

in function _get_parents_and_merge

parent_node = self._storage_context.vector_store.get_document(
                    parent_node_id
                )

and: in function _fill_in_nodes:

next_node = self._storage_context.vector_store.get_document(
                    cur_node.next_node.node_id
                )

define a custom retriever :

from typing import Dict, List, Optional, Tuple

from llama_index.core.indices.query.schema import QueryBundle
from llama_index.core.indices.utils import truncate_text
from llama_index.core.indices.vector_store.retrievers.retriever import (
    VectorIndexRetriever,
)
from llama_index.core.schema import BaseNode, IndexNode, NodeWithScore
from typing import Dict, List, Optional, Tuple, cast

from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.indices.query.schema import QueryBundle
from llama_index.core.indices.utils import truncate_text
from llama_index.core.indices.vector_store.retrievers.retriever import (
    VectorIndexRetriever,
)
from llama_index.core.schema import BaseNode, IndexNode, NodeWithScore, QueryBundle
from llama_index.core.storage.storage_context import StorageContext

class MyRetriever(AutoMergingRetriever):
    def __init__(
        self,
        node_vector_retriever: VectorIndexRetriever,
        leaf_node_vector_retriever: VectorIndexRetriever,
        storage_context: StorageContext,
        simple_ratio_thresh: float = 0.5,
        verbose: bool = False,
        callback_manager: Optional[CallbackManager] = None,
        object_map: Optional[dict] = None,
        objects: Optional[List[IndexNode]] = None,
    ) -> None:
        super().__init__(
            vector_retriever=node_vector_retriever,
            storage_context=storage_context,
            simple_ratio_thresh=simple_ratio_thresh,
            verbose=verbose,
            callback_manager=callback_manager,
            object_map=object_map,
            objects=objects,
        )
        self._leaf_node_vector_retriever = leaf_node_vector_retriever

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        initial_nodes = self._vector_retriever.retrieve(query_bundle)
        initial_leaf_nodes = self._leaf_node_vector_retriever.retrieve(query_bundle)

        # Merge the initial nodes and leaf nodes
        initial_nodes.extend(initial_leaf_nodes)

        cur_nodes, is_changed = self._try_merging(initial_nodes)
        while is_changed:
            cur_nodes, is_changed = self._try_merging(cur_nodes)

        # sort by similarity
        cur_nodes.sort(key=lambda x: x.get_score(), reverse=True)

        return cur_nodes

Usage:

from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core import VectorStoreIndex 
from llama_index.core.node_parser import get_leaf_nodes, get_root_nodes
 # Some more imports here

parent_vector_store = SupabaseVectorStore(
    postgres_connection_string=DB_CONNECTION,
    collection_name='parent_nodes',
    dimension=768
)
doc_storage_context = StorageContext.from_defaults(vector_store=parent_vector_store)
doc_index = VectorStoreIndex(nodes=nodes, storage_context=doc_storage_context)

leaf_nodes = get_leaf_nodes(nodes)

leaf_vector_store = SupabaseVectorStore(
    postgres_connection_string=DB_CONNECTION,
    collection_name='leaf_nodes',
    dimension=768
)
leaf_storage_context = StorageContext.from_defaults(vector_store=leaf_vector_store)
leaf_index = VectorStoreIndex(nodes=leaf_nodes, storage_context=leaf_storage_context)
leaf_retriever = leaf_index.as_retriever(similarity_top_k=6)
base_retriever = doc_index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(leaf_retriever, storage_context=doc_storage_context, verbose=True)

my_retriever = MyRetriever(node_vector_retriever=base_retriever, leaf_node_vector_retriever=leaf_retriever, storage_context=doc_storage_context, verbose=True)

query_str = "What did the author do growing up?"
my_retrived_nodes = my_retriever.retrieve(query_str)

Reason

AutoMergingRetriever pipeline using Supabase

Value of Feature

Complete Supabase RAG pipeline that can be used as a backend for any Frontend frameworks ready for consumption.

dosubot[bot] commented 4 months ago

🤖

Hey @rahulkrprajapati, good to see you back! Hope you're doing great. Let's dive into this intriguing issue you've encountered.

Based on the information you've provided, it seems like the recursive call is happening due to the _retrieve method in the AutoMergingRetriever class. This method retrieves the initial nodes and then attempts to merge them. If any changes occur during the merging process, the code enters a while loop where it continues to try merging the current nodes until no more changes occur.

The recursion is happening because the _try_merging method, which is called inside the while loop, can potentially change the is_changed variable to True again, causing the loop to continue indefinitely. This can happen if the conditions for merging nodes in the _get_parents_and_merge and _fill_in_nodes methods are continuously met.

To fix this issue, you could add a condition to break the loop after a certain number of iterations or when no significant changes are made to the nodes. Here's an example of how you can do this:

MAX_ITERATIONS = 10  # Set a maximum number of iterations

def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    """Retrieve nodes given query.

    Implemented by the user.
    """
    initial_nodes = self._vector_retriever.retrieve(query_bundle)

    cur_nodes, is_changed = self._try_merging(initial_nodes)
    iterations = 0
    while is_changed and iterations < MAX_ITERATIONS:
        prev_nodes = cur_nodes.copy()
        cur_nodes, is_changed = self._try_merging(cur_nodes)
        iterations += 1
        # Break the loop if no significant changes are made to the nodes
        if cur_nodes == prev_nodes:
            break

    # sort by similarity
    cur_nodes.sort(key=lambda x: x.get_score(), reverse=True)

    return cur_nodes

In this code, the while loop will break after MAX_ITERATIONS iterations or when the current nodes are the same as the previous nodes, indicating that no significant changes are being made. This should prevent the infinite recursion issue.

As for the get_document method in the SupabaseVectorStore class, I wasn't able to find specific information about its implementation in the LlamaIndex repository. It's possible that this method is used to retrieve a document from the vector store using its document ID, but without more information or context, it's hard to say for sure. If you could provide more details about this method and how it's used in your code, I might be able to give a more accurate answer.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

llama_index/retrievers/auto_merging_retriever.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

rahulkrprajapati commented 4 months ago

There's a slight correction:

while is_changed and iterations < self.MAX_ITERATIONS:
        prev_nodes = cur_nodes.copy()
        cur_nodes, is_changed = self._try_merging(cur_nodes)
        iterations += 1
        # Break the loop if no significant changes are made to the nodes
        if cur_nodes == prev_nodes:
            break

logan-markewich commented 4 months ago

Hmm. Trying to think of a way to make this change generalize across the codebase.

Maybe if get_document() was implemented in the base vector store class, the typing could be updated to satisfy mypy where this function is used

logan-markewich commented 4 months ago

the sad part about that is, we would need to invest considerable effort to implement that method for (most) vector dbs

rahulkrprajapati commented 4 months ago

Would love to help out.

I think having just three options for the docstore is a bummer 😔. Would love to have a few more options available. And Supabase just seemed like a good alternative given my use cases. It'll be awesome if I'm able to make it work for more vector dbs.

logan-markewich commented 4 months ago

@rahulkrprajapati yea totally agree, would like to have more docstore options (right now we have simple, redis, mongodb, postgress, firestore, and dynamodb)

c-goosen commented 2 months ago

Willing to help. Working with qdrant and stuck at this point as well.

run-llama / llama_index