run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.74k stars 5.27k forks source link

Updating nodes in the index with a vector store storage context #16833

Open hz6yc3 opened 1 week ago

hz6yc3 commented 1 week ago

Question Validation

Question

I have an index created with the Milvus vector store as the storage context where the embeddings are stored. The index was created by loading HTML documents and parsing them into nodes using HTMLNodeParser() like this index = VectorStoreIndex(nodes, storage_context=storage_context) I have updated versions of the raw HTML files and trying to find a way to update the index with the updated content as well as insert new content into the Index. Based on what I am reading in the docs and forums, the update_ref_docs and refresh_ref_docs functions seem to take a sequence of Document objects as input for updating existing documents in the index but don't see a way to update the relevant nodes for the existing document nodes in the vector store

dosubot[bot] commented 1 week ago

Hey @hz6yc3! 👋 I'm here to help you with any bugs, questions, or becoming a contributor. I'm Dosu, a bot designed to assist you while you wait for a human maintainer.

I found a similar unsolved discussion that might be relevant to your question. You can check out this discussion for more insights [1].

To continue talking to Dosu, mention @dosu.

hz6yc3 commented 1 week ago

@dosu No the provided thread does not answer my question and is not relevant to the question I asked above

logan-markewich commented 1 week ago

@hz6yc3 assuming you maintain consistent doc_ids for your documents (i.e. the document for a particular webpage always has the same doc_id), that is what update/refresh_ref_docs uses to anchor updates

It finds matching nodes with the same ref_doc_id and compares hashes

Now the caveat, this only works with a docstore. And by default, using an external db like milvus disables the docstore to simplify storage.

I suggest following this guide tbh -- the only change is, attach a vector store to the docstore as well, and stuff will automatically get upserted/inserted as needing (assuming you have consistent doc_ids on input documents)

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline/

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=docstore
    vector_store=vector_store,
)

You can also use a remote docstore like redis (ignore the cache, you don't need it) https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline/

dosubot[bot] commented 1 week ago

To update nodes in a Milvus vector store index with updated content and insert new content, you can use the update_ref_doc and refresh_ref_docs methods, which require a sequence of Document objects. Here's how you can use these methods:

  1. Update Existing Content: Use the update_ref_doc method to update a document and its corresponding nodes. This method deletes the existing document and inserts the updated one.

    # Assuming `index` is an instance of a class derived from BaseIndex
    # and `updated_document` is a Document object with updated content
    index.update_ref_doc(updated_document)
  2. Insert New Content: If you have new documents to insert, you can use the insert method directly.

    # Assuming `new_document` is a Document object with new content
    index.insert(new_document)
  3. Refresh Content: Use the refresh_ref_docs method to refresh the index with documents that have changed. This method updates documents that have any changes in text or metadata and inserts any documents that were not previously stored.

    # Assuming `documents` is a sequence of Document objects
    index.refresh_ref_docs(documents)

These methods handle the necessary operations to update and insert documents in the index, ensuring that the vector store is kept up-to-date with the latest content. The update_ref_doc method is equivalent to deleting the document and then inserting it again, while refresh_ref_docs optimizes by only updating documents with changes in text or metadata [1][2][3].

To continue talking to Dosu, mention @dosu.

hz6yc3 commented 1 week ago

@logan-markewich Thanks for your response! Yes I have changed the ingestion and updating the metadata to use a consistent document id. So I should set up an external docstore like Redis or MongoDB? Also, because I did not create a docstore for the already existing index, I am assuming that I won't be able to update the previously indexed documents/nodes? Also, how do I create the index after generating the nodes from the ingestion pipeline?