high score with empty document string

jdongca2003 commented 8 months ago

When qdrant is used for vector embedding indexing, an empty document in the collection will obtain a high similarity score (strange).

Here we can observe that the empty document strange achieves "0.797" cosine similarity score. If the document is empty, I assume that zero vector is used. cosine similarity score should be 0. Can you help ?

e.g.

from typing import List import numpy as np from qdrant_client import QdrantClient

documents: List[str] = [ "", "email address", "placeholder", "", "wireless customer", "He died in 1597 at the age of 57", "Maharana Pratap is considered a symbol of Rajput resistance against foreign rule", "He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar", "total active lines", "" ]

client = QdrantClient(":memory:") client.set_model("BAAI/bge-small-en") metadata = [ {"source": "docs"} for doc in documents] ids = [ idx for idx in range(len(documents))]

client.add( collection_name="demo_collection", documents=documents, metadata=metadata, ids=ids )

query = "Count the number of active residential customer" search_result = client.query( collection_name = "demo_collection", query_text = query, limit= 5) print(search_result)

Results:

[QueryResponse(id=8, embedding=None, metadata={'document': 'total active lines', 'source': 'docs'}, document='total active lines', score=0.8843179222400359), QueryResponse(id=4, embedding=None, metadata={'document': 'wireless customer', 'source': 'docs'}, document='wireless customer', score=0.8295176016136243), QueryResponse(id=1, embedding=None, metadata={'document': 'email address', 'source': 'docs'}, document='email address', score=0.8228306079924803), QueryResponse(id=2, embedding=None, metadata={'document': 'placeholder', 'source': 'docs'}, document='placeholder', score=0.8144248465983718), QueryResponse(id=9, embedding=None, metadata={'document': '', 'source': 'docs'}, document='', score=0.7972171966992909)]

joein commented 8 months ago

Hi @jdongca2003

I don't think that if the string is empty vector should be zero, it actually depends on the model you are using.

Could you please check whether you have a similar result with the original BAAI/bge-small-en model ? (What I mean is: take a model from huggingface, compute the embeddings manually for your documents and check whether the situation is the same)

jdongca2003 commented 8 months ago

Thank Joein for quick response. I checked embedding vector of empty document. It is not zero vector! But it is still not a good behavior.

from typing import List
import numpy as np
from fastembed import TextEmbedding
import json

documents: List[str] = [
    "",
    "email address",
    "placeholder",
    "",
    "wireless customer",
    "He died in 1597 at the age of 57",
    "Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
    "He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
    "total active lines",
    ""
]

embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)

embeddings: List[np.ndarray] = list(
    embedding_model.passage_embed(documents)
)  # notice that we are casting the generator to a list

#print(embeddings[0].shape, len(embeddings))

query = "Count the number of active residential customer"
query_embedding = list(embedding_model.query_embed(query))[0]

def print_top_k(query_embedding, embeddings, documents, k=5):
    # use numpy to calculate the cosine similarity between the query and the documents
    scores = np.dot(embeddings, query_embedding)
    for score, doc in zip(scores, documents):
        print(f'{doc}|score: {score}')
    # sort the scores in descending order
    sorted_scores = np.argsort(scores)[::-1]
    # print the top 5
    #for i in range(k):
    #    print(f"score: {scores[sorted_scores[i]]} Rank {i+1}: {documents[sorted_scores[i]]}")

print_top_k(query_embedding, embeddings, documents, k=5)

I directly calculated cosine similarity score. |score: 0.7972172498703003 email address|score: 0.8228306174278259 placeholder|score: 0.814424991607666 |score: 0.7972172498703003 wireless customer|score: 0.8295177221298218 He died in 1597 at the age of 57|score: 0.7157479524612427 Maharana Pratap is considered a symbol of Rajput resistance against foreign rule|score: 0.7073748111724854 He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar|score: 0.7003960609436035

embedding vector for empty document:

dim = 384 [-2.53819916e-02 -5.44682052e-03 -5.09282853e-03 -1.49776395e-02 -1.08098146e-02 1.19938692e-02 1.92262717e-02 4.08581644e-02 -9.28279664e-03 1.56196468e-02 1.86153606e-03 -4.88135368e-02 6.96400367e-03 3.49483788e-02 3.50163616e-02 4.01080912e-03 3.18448767e-02 1.36998445e-02 -1.56665053e-02 1.64450370e-02 2.16239858e-02 -1.99406147e-02 1.17815230e-02 -1.80905703e-02 4.76054614e-03 2.72297114e-02 -5.90159511e-03 -8.18434451e-03 -4.85137738e-02 -1.91728160e-01 -3.33202034e-02 -1.37138087e-02 3.19078634e-03 -9.87244491e-03 -1.03822276e-02 -9.70588345e-03 -1.62116215e-02 1.38158510e-02 -1.09591316e-02 4.05766815e-02 2.16749441e-02 1.38471741e-02 -1.54241202e-02 -1.06100161e-02 5.69914840e-03 -2.26438437e-02 -1.67865120e-02 -6.69355411e-03 5.80454506e-02 -6.32909359e-03 2.05236953e-03 1.03720073e-02 ...

joein commented 8 months ago

Could you elaborate on not a good behavior, what do you mean exactly?

jdongca2003 commented 8 months ago

I mean that a good scoring behavior is that a low score is for empty document when the query is a natural text.

joein commented 8 months ago

Unfortunately, we can't do anything about this, Qdrant provides a way to operate with embeddings, it can't do anything with the embedding values. Embedding values are determined by the model you've chosen.

jdongca2003 commented 8 months ago

Thank joein. Your clarification is very reasonable.

qdrant / qdrant-client

high score with empty document string #542