Open hz6yc3 opened 1 week ago
Hey @hz6yc3! 👋 I'm here to help you with any bugs, questions, or becoming a contributor. I'm Dosu, a bot designed to assist you while you wait for a human maintainer.
I found a similar unsolved discussion that might be relevant to your question. You can check out this discussion for more insights [1].
To continue talking to Dosu, mention @dosu.
@dosu No the provided thread does not answer my question and is not relevant to the question I asked above
@hz6yc3 assuming you maintain consistent doc_ids
for your documents (i.e. the document for a particular webpage always has the same doc_id), that is what update/refresh_ref_docs uses to anchor updates
It finds matching nodes with the same ref_doc_id and compares hashes
Now the caveat, this only works with a docstore. And by default, using an external db like milvus disables the docstore to simplify storage.
I suggest following this guide tbh -- the only change is, attach a vector store to the docstore as well, and stuff will automatically get upserted/inserted as needing (assuming you have consistent doc_ids on input documents)
https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline/
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
],
docstore=docstore
vector_store=vector_store,
)
You can also use a remote docstore like redis (ignore the cache, you don't need it) https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline/
To update nodes in a Milvus vector store index with updated content and insert new content, you can use the update_ref_doc
and refresh_ref_docs
methods, which require a sequence of Document
objects. Here's how you can use these methods:
Update Existing Content: Use the update_ref_doc
method to update a document and its corresponding nodes. This method deletes the existing document and inserts the updated one.
# Assuming `index` is an instance of a class derived from BaseIndex
# and `updated_document` is a Document object with updated content
index.update_ref_doc(updated_document)
Insert New Content: If you have new documents to insert, you can use the insert
method directly.
# Assuming `new_document` is a Document object with new content
index.insert(new_document)
Refresh Content: Use the refresh_ref_docs
method to refresh the index with documents that have changed. This method updates documents that have any changes in text or metadata and inserts any documents that were not previously stored.
# Assuming `documents` is a sequence of Document objects
index.refresh_ref_docs(documents)
These methods handle the necessary operations to update and insert documents in the index, ensuring that the vector store is kept up-to-date with the latest content. The update_ref_doc
method is equivalent to deleting the document and then inserting it again, while refresh_ref_docs
optimizes by only updating documents with changes in text or metadata [1][2][3].
To continue talking to Dosu, mention @dosu.
@logan-markewich Thanks for your response! Yes I have changed the ingestion and updating the metadata to use a consistent document id. So I should set up an external docstore like Redis or MongoDB? Also, because I did not create a docstore for the already existing index, I am assuming that I won't be able to update the previously indexed documents/nodes? Also, how do I create the index after generating the nodes from the ingestion pipeline?
Question Validation
Question
I have an index created with the Milvus vector store as the storage context where the embeddings are stored. The index was created by loading HTML documents and parsing them into nodes using HTMLNodeParser() like this
index = VectorStoreIndex(nodes, storage_context=storage_context)
I have updated versions of the raw HTML files and trying to find a way to update the index with the updated content as well as insert new content into the Index. Based on what I am reading in the docs and forums, the update_ref_docs and refresh_ref_docs functions seem to take a sequence of Document objects as input for updating existing documents in the index but don't see a way to update the relevant nodes for the existing document nodes in the vector store