run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.47k stars 5.21k forks source link

[Bug]: Opensearch does not work with TextNode class #15736

Open mdciri opened 2 months ago

mdciri commented 2 months ago

Bug Description

I am trying to save my indexVectorStore in Opensearch, but it does not accept the nodes I created with the automerging retriever strategy. The leaf nodes created with the HierarchicalNodeParser() are TextNode() type and it does not have, obviously, the get_doc_id() method.

Version

0.10.68

Steps to Reproduce

from llama_index.core import StorageContext, ServiceContext, VectorStoreIndex
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.vector_stores.opensearch import  OpensearchVectorStore,   OpensearchVectorClient

from src.modules.vector_database import create_documentation

node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128], 
    chunk_overlap=20
)

merging_context = ServiceContext.from_defaults(
    llm=model,
    embed_model=embed_model,
    node_parser=node_parser
)

documents = create_documentation(documentation_folder)
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)

index_name = os.getenv("OPENSEARCH_INDEX", "vector-index-test")
endpoint = os.getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")
text_field = "content"
embedding_field = "embedding"

client = OpensearchVectorClient(
    endpoint, index_name, 1024, embedding_field=embedding_field, text_field=text_field,
)

vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

automerging_index = VectorStoreIndex.from_documents(
    leaf_nodes,
    storage_context=storage_context,
    service_context=merging_context
)

Screenshot from 2024-08-30 09-37-52

Relevant Logs/Tracbacks

WARNING:opensearch:GET http://localhost:9200/vector-index-test [status:404 request:0.035s]
INFO:opensearch:PUT http://localhost:9200/vector-index-test [status:200 request:0.169s]
INFO:opensearch:POST http://localhost:9200/vector-index-test/_refresh [status:200 request:0.007s]

look at the screenshot in `steps to reproduce` to visualize the error
dosubot[bot] commented 2 months ago

I found a similar closed issue that might help you resolve this problem: [Bug]: AttributeError: 'TextNode' object has no attribute 'get_doc_id' [1]. The issue arises because the TextNode class does not have the get_doc_id() method, which is required by the OpensearchVectorClient class when indexing results.

To resolve this, you need to ensure that the TextNode class implements the get_doc_id() method. Here is an example of how you can add this method to the TextNode class:

class TextNode(BaseNode):
    def __init__(self, text, metadata=None, id_=None, start_char_idx=None, end_char_idx=None, relationships=None, extra_info=None):
        super().__init__(text, metadata, id_, start_char_idx, end_char_idx, relationships, extra_info)
        self.node_id = id_ if id_ else str(uuid.uuid4())

    def get_doc_id(self):
        return self.node_id

By adding the get_doc_id() method to the TextNode class, you ensure that the OpensearchVectorClient can correctly retrieve the document ID when indexing nodes [2].

To continue talking to Dosu, mention @dosu.

mdciri commented 2 months ago

I am sorry, I found the error to create the index. It was really easy to catch that I was using wrongly .from_documents(). Anyway, now, that I created the index, it does not work when I do:

from llama_index.core.retrievers import AutoMergingRetriever

base_retriever = automerging_index.as_retriever(similarity_top_k=5)
retriever = AutoMergingRetriever(
    vector_retriever=base_retriever,
    storage_context=automerging_index.storage_context, 
    verbose=True
)

query_str = "<my_str_query>"
nodes = retriever.retrieve(query_str)

because I get:

INFO:opensearch:POST http://localhost:9200/vector-index-test/_search [status:200 request:0.052s]
---->  nodes = retriever.retrieve(query_str)
ValueError: doc_id f242477a-6be5-4905-bf20-33125fa0c876 not found.

it works instead if I simply do:

base_retriever.retrieve(query_str)

how is that possible?

dosubot[bot] commented 2 months ago

The AutoMergingRetriever returns a "doc_id not found" error because it relies on a hierarchical structure of nodes stored in a docstore. If the docstore does not contain the necessary nodes or if there is a mismatch in the node IDs, the AutoMergingRetriever will not be able to find the required documents. On the other hand, base_retriever.retrieve(query_str) works correctly because it directly retrieves from the vector store containing the leaf-level nodes, which are correctly indexed [1][2].

To continue talking to Dosu, mention @dosu.