Closed jdongca2003 closed 8 months ago
Hi @jdongca2003
I don't think that if the string is empty vector should be zero, it actually depends on the model you are using.
Could you please check whether you have a similar result with the original BAAI/bge-small-en model ? (What I mean is: take a model from huggingface, compute the embeddings manually for your documents and check whether the situation is the same)
Thank Joein for quick response. I checked embedding vector of empty document. It is not zero vector! But it is still not a good behavior.
from typing import List
import numpy as np
from fastembed import TextEmbedding
import json
documents: List[str] = [
"",
"email address",
"placeholder",
"",
"wireless customer",
"He died in 1597 at the age of 57",
"Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
"He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
"total active lines",
""
]
embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)
embeddings: List[np.ndarray] = list(
embedding_model.passage_embed(documents)
) # notice that we are casting the generator to a list
#print(embeddings[0].shape, len(embeddings))
query = "Count the number of active residential customer"
query_embedding = list(embedding_model.query_embed(query))[0]
def print_top_k(query_embedding, embeddings, documents, k=5):
# use numpy to calculate the cosine similarity between the query and the documents
scores = np.dot(embeddings, query_embedding)
for score, doc in zip(scores, documents):
print(f'{doc}|score: {score}')
# sort the scores in descending order
sorted_scores = np.argsort(scores)[::-1]
# print the top 5
#for i in range(k):
# print(f"score: {scores[sorted_scores[i]]} Rank {i+1}: {documents[sorted_scores[i]]}")
print_top_k(query_embedding, embeddings, documents, k=5)
I directly calculated cosine similarity score. |score: 0.7972172498703003 email address|score: 0.8228306174278259 placeholder|score: 0.814424991607666 |score: 0.7972172498703003 wireless customer|score: 0.8295177221298218 He died in 1597 at the age of 57|score: 0.7157479524612427 Maharana Pratap is considered a symbol of Rajput resistance against foreign rule|score: 0.7073748111724854 He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar|score: 0.7003960609436035
embedding vector for empty document:
dim = 384 [-2.53819916e-02 -5.44682052e-03 -5.09282853e-03 -1.49776395e-02 -1.08098146e-02 1.19938692e-02 1.92262717e-02 4.08581644e-02 -9.28279664e-03 1.56196468e-02 1.86153606e-03 -4.88135368e-02 6.96400367e-03 3.49483788e-02 3.50163616e-02 4.01080912e-03 3.18448767e-02 1.36998445e-02 -1.56665053e-02 1.64450370e-02 2.16239858e-02 -1.99406147e-02 1.17815230e-02 -1.80905703e-02 4.76054614e-03 2.72297114e-02 -5.90159511e-03 -8.18434451e-03 -4.85137738e-02 -1.91728160e-01 -3.33202034e-02 -1.37138087e-02 3.19078634e-03 -9.87244491e-03 -1.03822276e-02 -9.70588345e-03 -1.62116215e-02 1.38158510e-02 -1.09591316e-02 4.05766815e-02 2.16749441e-02 1.38471741e-02 -1.54241202e-02 -1.06100161e-02 5.69914840e-03 -2.26438437e-02 -1.67865120e-02 -6.69355411e-03 5.80454506e-02 -6.32909359e-03 2.05236953e-03 1.03720073e-02 ...
Could you elaborate on not a good behavior
, what do you mean exactly?
I mean that a good scoring behavior is that a low score is for empty document when the query is a natural text.
Unfortunately, we can't do anything about this, Qdrant provides a way to operate with embeddings, it can't do anything with the embedding values. Embedding values are determined by the model you've chosen.
Thank joein. Your clarification is very reasonable.
When qdrant is used for vector embedding indexing, an empty document in the collection will obtain a high similarity score (strange).
Here we can observe that the empty document strange achieves "0.797" cosine similarity score. If the document is empty, I assume that zero vector is used. cosine similarity score should be 0. Can you help ?
e.g.
from typing import List import numpy as np from qdrant_client import QdrantClient
documents: List[str] = [ "", "email address", "placeholder", "", "wireless customer", "He died in 1597 at the age of 57", "Maharana Pratap is considered a symbol of Rajput resistance against foreign rule", "He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar", "total active lines", "" ]
client = QdrantClient(":memory:") client.set_model("BAAI/bge-small-en") metadata = [ {"source": "docs"} for doc in documents] ids = [ idx for idx in range(len(documents))]
client.add( collection_name="demo_collection", documents=documents, metadata=metadata, ids=ids )
query = "Count the number of active residential customer" search_result = client.query( collection_name = "demo_collection", query_text = query, limit= 5) print(search_result)
Results:
[QueryResponse(id=8, embedding=None, metadata={'document': 'total active lines', 'source': 'docs'}, document='total active lines', score=0.8843179222400359), QueryResponse(id=4, embedding=None, metadata={'document': 'wireless customer', 'source': 'docs'}, document='wireless customer', score=0.8295176016136243), QueryResponse(id=1, embedding=None, metadata={'document': 'email address', 'source': 'docs'}, document='email address', score=0.8228306079924803), QueryResponse(id=2, embedding=None, metadata={'document': 'placeholder', 'source': 'docs'}, document='placeholder', score=0.8144248465983718), QueryResponse(id=9, embedding=None, metadata={'document': '', 'source': 'docs'}, document='', score=0.7972171966992909)]