qdrant / qdrant-haystack

An integration of Qdrant ANN vector database backend with Haystack
Apache License 2.0
43 stars 12 forks source link

document_store.update_embeddings seems to update embeddings regardless of parameter #39

Open theoky opened 11 months ago

theoky commented 11 months ago

I'm using qdrant-haystack 1.0.11 with farm-haystack==1.21.2 and python 3.10.13 on Win10 and Qdrant running in Docker.

When updating the embeddings of a document store, document_store.update_embeddings seems to update all embeddings even when update_existing_embeddings is set to False.

I'm running this code:

import timeit
from haystack import Document
from haystack.nodes import EmbeddingRetriever
from qdrant_haystack.document_stores import QdrantDocumentStore

def update_embeddings(existing):
    document_store.update_embeddings(retriever, update_existing_embeddings=existing)

document_store = QdrantDocumentStore(url="localhost", index="test_update_embeddings",
                                    embedding_dim=512, similarity="cosine")

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="sentence-transformers/distiluse-base-multilingual-cased-v1",
                               use_gpu=False)

docs_to_index = [Document(content=str(i) + " random text"*100) for i in range(0, 50)]

document_store.write_documents(docs_to_index, duplicate_documents="skip")

res_upd = timeit.timeit(stmt='update_embeddings(True)', globals=globals(), number=2) 
res_noupd = timeit.timeit(stmt='update_embeddings(False)', globals=globals(), number=2)

print(f"Execution with update: {res_upd}, with no update: {res_noupd}")

After the execution the QDrant database contains 50 vectors, as expected.

I would also expect that update_embeddings(False) is running significantly faster than update_embeddings(True), but both statements run for nearly the same time: Execution with update: 22.15771689999383, with no update: 20.913242900016485

To me this looks like update_embeddings(..., update_existing_embeddings=False) is updating the embeddings, too.

What am I missing?

theoky commented 11 months ago

I've just found this comment in the relevant source file:

:param update_existing_embeddings: Not used by QdrantDocumentStore, as all the points
                                   must have a corresponding vector in Qdrant.

So for my use case:

using update_embeddings does not work.

So a working use case would be

So update_embeddings is basically useful only when I change the model generating the embeddings? This seems somehow a little bit against the intent of having a simple pipeline, at least to me.