[Bug]: Broken docstores -> broken indices

belmmostest commented 2 months ago

Bug Description

As described here a docstore should allow for efficient document management and creation of multiple indices without document duplication.

However, the features often inhibits efficient management. For example, if we have 1 docstore and construct two vector indexes on top of it (one with 512 and another with 1024 embed_dim), using refresh_ref_docs or 'delete_ref_docs` will only reflect document changes on the vector index we use these functions on first.

So if I have something like:

docs = [some_new_docs]
vector_index_512
vector_index_1024

and then try to run:
vector_index_512 .refresh_ref_docs(docs)
vector_index_1024.refresh_ref_docs(docs)

I "refresh" the nodes in the 512 index. However, the 1024 index does not get refreshed simply because it shares the docstore with 512. This behavior happens because the refresh and delete functions only check against the docstore and not, for example, if the ref_doc_ids/hashes match between the vector index and the docstore.

This doesn't come across as expected so I'm raising it as a bug. Let me know know if it's intended behavior and we can change it to a feature request.

Version

v0.10.57

Steps to Reproduce

docs = [some_new_docs]
vector_index_512
vector_index_1024

and then try to run:
vector_index_512 .refresh_ref_docs(docs)
vector_index_1024.refresh_ref_docs(docs)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 months ago

Hey @belmmostest! I'm here to help you with any issues or questions you have regarding the repository. I can assist you in solving bugs, answering questions, or becoming a contributor. Let me know how I can assist you!

The issue you're encountering with refresh_ref_docs and delete_ref_docs not reflecting changes across multiple vector indexes that share the same docstore is known. The LlamaCloudIndex class methods refresh_ref_docs and delete_ref_doc handle synchronization by interacting with the managed service's pipeline, ensuring document changes are propagated and the pipeline ingestion status is updated accordingly. However, this synchronization might not be fully effective in your case due to the different embedding dimensions.

Here is the relevant code for refresh_ref_docs:

def refresh_ref_docs(
        self, documents: Sequence[Document], **update_kwargs: Any
    ) -> List[bool]:
    """Refresh an index with documents that have changed."""
    with self._callback_manager.as_trace("refresh"):
        pipeline_id = self._get_pipeline_id()
        upserted_documents = self._client.pipelines.upsert_batch_pipeline_documents(
            pipeline_id=pipeline_id,
            request=[
                CloudDocumentCreate(
                    text=doc.text,
                    metadata=doc.metadata,
                    excluded_embed_metadata_keys=doc.excluded_embed_metadata_keys,
                    excluded_llm_metadata_keys=doc.excluded_llm_metadata_keys,
                    id=doc.id_,
                )
                for doc in documents
            ],
        )
        doc_ids = [doc.id for doc in upserted_documents]
        self._wait_for_documents_ingestion(
            doc_ids, verbose=True, raise_on_error=True
        )
        return [True] * len(doc_ids)

And for delete_ref_doc:

def delete_ref_doc(
    self,
    ref_doc_id: str,
    delete_from_docstore: bool = False,
    verbose: bool = False,
    raise_if_not_found: bool = False,
    **delete_kwargs: Any,
) -> None:
    """Delete a document and its nodes by using ref_doc_id."""
    pipeline_id = self._get_pipeline_id()
    try:
        self._client.pipelines.delete_pipeline_document(
            pipeline_id=pipeline_id, document_id=ref_doc_id
        )
    except ApiError as e:
        if e.status_code == 404 and not raise_if_not_found:
            logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
        else:
            raise

    # we have to wait for the pipeline instead of the document, because the document is already deleted
    self._wait_for_pipeline_ingestion(
        verbose=verbose, raise_on_partial_success=False
    )

These methods ensure that document changes are correctly reflected across all vector indexes that share the same docstore by updating the pipeline and waiting for the ingestion process to complete ^[1].

If this behavior is not as expected, it might be worth considering it as a feature request to enhance the synchronization mechanism to account for different embedding dimensions or other specific requirements.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 months ago

Hmm, I see the point, but I don't know what a solution is. How can vector_index_512 know the existence of the completely separate vector store in vector_index_1024 ?

belmmostest commented 2 months ago

It doesn't have to... The user would need to ensure that refresh_ref_docs is run on all vector indexes that share the docstore. But the refresh_ref_doc should have another "leg" that checks the following logic (rough outline):

If ref_doc_id exists in docstore but not vector store --> insert to vector store (most of the logic here is already covered by _handle_upserts - This alone would help already.
If ref_doc_id exists in both docstore and vector store BUT existing_doc_hash != document.hash we update. Here the existing_doc_hash is the hash from node.ref_doc_id
Step 2 would call for update_ref_docs and delete_ref_doc.

Another option might be to have refresh_ref_docs return refreshed_documents, refreshed_doc_ids which would make it easier to reflect the same changes on other indices.

In both of these options the user could opt in/out of secondary index refresh.

Another option might be having the index_struct play a role in this. Might be wrong, but the struct would be the only pointer that there are more indices constructed on top of the same docstore.

I see breadcrumbs everywhere that point to this being the desired behavior, like here:

By default, if a document's content has changed and it is updated, we can pass an extra flag to delete_from_docstore. This flag is False by default because indexes can share the docstore. But since we only have one index, removing from the docstore is fine here.

If we kept the option as False, the document information would still be removed from the index_struct, which effectively makes that document invisibile to the index.

Let me know if this makes any sense?

logan-markewich commented 2 months ago

This logic makes sense, but would require the vector store to have a method to lookup per node id

This is recently implemented in the base class (vector_store.get_nodes(node_ids=[...])), but is only implemented in about 6 or so vector stores (just a heads up I guess)

For 2+3, isn't this the logic already? If the hash from the docstore has changed, it deletes and re-inserts on both the docstore and vector store.

belmmostest commented 2 months ago

Thanks for the update!

For 2+3 yes the logic is similar but right now existing_doc_hash is the hash from docstore - in this case (2) document.hash would need to be compared to the hash stored in the node.metadata, otherwise the doc.hash would coincide with the docstore hash if the docstore was already updated (during the 512 refresh). Obviously not ideal though because it needs to check the entire index store.

Maybe better (as a temp patch) if there was a higher level util refresh_indices -> takes in index mapping {"v1": 512, "v2":1024} -> calls refresh_ref_docs on v1 -> runs update_ref_docs (delete_from_docstore = False) with the same doc list on v2?

Premise of this is that update is essentially delete + insert and does not check against the docstore. Delete checks against the docstore but doesn't prevent deleting the nodes from v2.

I agree that it's a trickier issue, but otherwise docstore's utility is marginal (considering most real applications would expect docstores to be more dynamic). Especially apparent in evaluation scenarios where we frequently run several vector stores with different chunk_sizes, overlaps, etc.

run-llama / llama_index