Open belmmostest opened 2 months ago
Hey @belmmostest! I'm here to help you with any issues or questions you have regarding the repository. I can assist you in solving bugs, answering questions, or becoming a contributor. Let me know how I can assist you!
The issue you're encountering with refresh_ref_docs
and delete_ref_docs
not reflecting changes across multiple vector indexes that share the same docstore is known. The LlamaCloudIndex
class methods refresh_ref_docs
and delete_ref_doc
handle synchronization by interacting with the managed service's pipeline, ensuring document changes are propagated and the pipeline ingestion status is updated accordingly. However, this synchronization might not be fully effective in your case due to the different embedding dimensions.
Here is the relevant code for refresh_ref_docs
:
def refresh_ref_docs(
self, documents: Sequence[Document], **update_kwargs: Any
) -> List[bool]:
"""Refresh an index with documents that have changed."""
with self._callback_manager.as_trace("refresh"):
pipeline_id = self._get_pipeline_id()
upserted_documents = self._client.pipelines.upsert_batch_pipeline_documents(
pipeline_id=pipeline_id,
request=[
CloudDocumentCreate(
text=doc.text,
metadata=doc.metadata,
excluded_embed_metadata_keys=doc.excluded_embed_metadata_keys,
excluded_llm_metadata_keys=doc.excluded_llm_metadata_keys,
id=doc.id_,
)
for doc in documents
],
)
doc_ids = [doc.id for doc in upserted_documents]
self._wait_for_documents_ingestion(
doc_ids, verbose=True, raise_on_error=True
)
return [True] * len(doc_ids)
And for delete_ref_doc
:
def delete_ref_doc(
self,
ref_doc_id: str,
delete_from_docstore: bool = False,
verbose: bool = False,
raise_if_not_found: bool = False,
**delete_kwargs: Any,
) -> None:
"""Delete a document and its nodes by using ref_doc_id."""
pipeline_id = self._get_pipeline_id()
try:
self._client.pipelines.delete_pipeline_document(
pipeline_id=pipeline_id, document_id=ref_doc_id
)
except ApiError as e:
if e.status_code == 404 and not raise_if_not_found:
logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
else:
raise
# we have to wait for the pipeline instead of the document, because the document is already deleted
self._wait_for_pipeline_ingestion(
verbose=verbose, raise_on_partial_success=False
)
These methods ensure that document changes are correctly reflected across all vector indexes that share the same docstore by updating the pipeline and waiting for the ingestion process to complete [1].
If this behavior is not as expected, it might be worth considering it as a feature request to enhance the synchronization mechanism to account for different embedding dimensions or other specific requirements.
To continue talking to Dosu, mention @dosu.
Hmm, I see the point, but I don't know what a solution is. How can vector_index_512
know the existence of the completely separate vector store in vector_index_1024
?
It doesn't have to... The user would need to ensure that refresh_ref_docs
is run on all vector indexes that share the docstore. But the refresh_ref_doc
should have another "leg" that checks the following logic (rough outline):
ref_doc_id
exists in docstore but not vector store --> insert to vector store (most of the logic here is already covered by _handle_upserts - This alone would help already. ref_doc_id
exists in both docstore and vector store BUT existing_doc_hash != document.hash
we update. Here the existing_doc_hash
is the hash from node.ref_doc_id
update_ref_docs
and delete_ref_doc
. Another option might be to have refresh_ref_docs
return refreshed_documents, refreshed_doc_ids
which would make it easier to reflect the same changes on other indices.
In both of these options the user could opt in/out of secondary index refresh.
Another option might be having the index_struct
play a role in this. Might be wrong, but the struct would be the only pointer that there are more indices constructed on top of the same docstore.
I see breadcrumbs everywhere that point to this being the desired behavior, like here:
By default, if a document's content has changed and it is updated, we can pass an extra flag to delete_from_docstore. This flag is False by default because indexes can share the docstore. But since we only have one index, removing from the docstore is fine here.
If we kept the option as False, the document information would still be removed from the index_struct, which effectively makes that document invisibile to the index.
Let me know if this makes any sense?
This logic makes sense, but would require the vector store to have a method to lookup per node id
This is recently implemented in the base class (vector_store.get_nodes(node_ids=[...])
), but is only implemented in about 6 or so vector stores (just a heads up I guess)
For 2+3, isn't this the logic already? If the hash from the docstore has changed, it deletes and re-inserts on both the docstore and vector store.
Thanks for the update!
For 2+3 yes the logic is similar but right now existing_doc_hash
is the hash from docstore
- in this case (2) document.hash
would need to be compared to the hash
stored in the node.metadata
, otherwise the doc.hash would coincide with the docstore hash if the docstore was already updated (during the 512 refresh). Obviously not ideal though because it needs to check the entire index store.
Maybe better (as a temp patch) if there was a higher level util refresh_indices
-> takes in index mapping {"v1": 512, "v2":1024} -> calls refresh_ref_docs
on v1 -> runs update_ref_docs (delete_from_docstore = False)
with the same doc list on v2?
Premise of this is that update is essentially delete + insert and does not check against the docstore. Delete checks against the docstore but doesn't prevent deleting the nodes from v2.
I agree that it's a trickier issue, but otherwise docstore's utility is marginal (considering most real applications would expect docstores to be more dynamic). Especially apparent in evaluation scenarios where we frequently run several vector stores with different chunk_sizes, overlaps, etc.
Bug Description
As described here a docstore should allow for efficient document management and creation of multiple indices without document duplication.
However, the features often inhibits efficient management. For example, if we have 1 docstore and construct two vector indexes on top of it (one with 512 and another with 1024 embed_dim), using
refresh_ref_docs
or 'delete_ref_docs` will only reflect document changes on the vector index we use these functions on first.So if I have something like:
I "refresh" the nodes in the 512 index. However, the 1024 index does not get refreshed simply because it shares the docstore with 512. This behavior happens because the
refresh
anddelete
functions only check against the docstore and not, for example, if the ref_doc_ids/hashes match between the vector index and the docstore.This doesn't come across as expected so I'm raising it as a bug. Let me know know if it's intended behavior and we can change it to a feature request.
Version
v0.10.57
Steps to Reproduce
Relevant Logs/Tracbacks
No response