run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.28k stars 5.18k forks source link

[Question]: Recursive Retriever is not working with Storage Context #12413

Closed aliozts closed 3 months ago

aliozts commented 7 months ago

Question Validation

Question

Following the example from the documentation. I wanted to create a RecursiveRetriever utilizing Qdrant as the vector db and Redis as the docstore. I'm doing

def create_sub_nodes(base_nodes):
    sub_chunk_sizes = [256, 512, 1024]
    sub_node_parsers = [SentenceSplitter(chunk_size=c) for c in sub_chunk_sizes]

    all_nodes = []

    for base_node in base_nodes:
        for n in sub_node_parsers:
            sub_nodes = n.get_nodes_from_documents([base_node])
            sub_inodes = [
                IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
            ]
            all_nodes.extend(sub_inodes)

        # also add original node to node
        original_node = IndexNode.from_text_node(base_node, base_node.node_id)
        all_nodes.append(original_node)
    all_nodes_dict = {n.node_id: n for n in all_nodes}
    return all_nodes, all_nodes_dict

redis_client = Redis(
    host="localhost",
    port=6379,
    decode_responses=True,
)
redis_kvstore = RedisKVStore.from_redis_client(redis_client=redis_client)
docstore = RedisDocumentStore(
        redis_kvstore=redis_kvstore, namespace=collection_name
    )
client = QdrantClient(
    location=":memory:"
)
vector_store = QdrantVectorStore(
    collection_name=collection_name,
    client=client,
    enable_hybrid=False,
    batch_size=16,
)
nodes, nodes_dict = create_sub_nodes(base_nodes=base_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)

vector_index_chunk = VectorStoreIndex(nodes=nodes, embed_model=embed_model, insert_batch_size=128, storage_context=storage_context, show_progress=True)
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=10)

retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=nodes_dict, # will be replaced with docstore.docs
    verbose=True,
)

When I perform retrieval with this retriever_chunk, I cannot retrieve any documents. If I don't use the StorageContext. It works properly. If I create the vectorstoreindex this way and utilize VectorStoreIndex.from_vector_store it also works (not sure if correct though). May I ask if this can be checked? Thank you

dosubot[bot] commented 7 months ago

Hey there @aliozts! Great to see you diving into another challenge 🚀

Based on the information you've provided, it seems like you're experiencing an issue with the RecursiveRetriever not working as expected when used with a StorageContext. This issue might be related to the way the StorageContext interacts with the VectorStoreIndex and RecursiveRetriever.

From the context provided, it appears that there are similar issues reported in the LlamaIndex repository. One such issue is Failing to retrieve documents because vector store is empty when storage context is loaded, and another is [Bug]: cannot get document node. The suggested solution for these issues is to ensure that the document ID exists within the correct collection, the namespace and collection suffix are correctly configured, and the key-value store is functioning properly.

In the LlamaIndex codebase, the StorageContext is used to manage different types of stores involved in storing and retrieving data. It encapsulates the document store (docstore), index store (index_store), vector stores (vector_stores), and graph store (graph_store). These components are essential for the retrieval process in the LlamaIndex system, enabling efficient storage, indexing, and retrieval of documents, vectors, and graph data.

The VectorStoreIndex class in the LlamaIndex codebase is designed to work on top of an existing vector store, facilitating the indexing and retrieval of vectorized representations of data. Its primary role is to manage the process of converting data nodes into their vector embeddings, storing these embeddings, and then allowing for efficient retrieval based on these vector representations.

The RecursiveRetriever class in the LlamaIndex codebase is designed to handle the retrieval of nodes in a recursive manner, particularly focusing on exploring links from nodes to other retrievers or query engines. Its primary role in the retrieval process is to navigate through the nodes, identify if they are IndexNodes, and if so, query the linked retriever or query engine associated with those nodes.

Given this information, it's possible that the issue you're experiencing is due to a misconfiguration or misunderstanding of how these components interact. I would recommend checking the configuration of your StorageContext and ensuring that your VectorStoreIndex and RecursiveRetriever are set up correctly. If you're still experiencing issues, it would be helpful to provide more information about the specific error messages you're seeing or the behavior that's not as expected.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
aliozts commented 6 months ago

The way I used is to create it with storage_context first for the vector store and doc store to be filled. Then call it with

# first create the index with vector store and docstore so that they are populated
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)

vector_index_chunk = VectorStoreIndex(nodes=nodes, embed_model=embeddings_creator, insert_batch_size=128, storage_context=storage_context, show_progress=True)
vector_index_chunk = VectorStoreIndex.from_vector_store(embed_model=embeddings_creator,
                                      vector_store=vector_store
                                     )

vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=10)

retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=docstore.docs,
    verbose=True,
)

This worked for me but I'd prefer to not have this way around and use the vector_retriever_chunk directly.