Open xyzgorba opened 3 months ago
Hey @xyzgorba, good to see you back in the trenches of code! Hope you're ready to untangle this mystery together. 🧐
The discrepancy you're observing between the similarity scores returned by query_engine.query(qry_string)
and those obtained by directly calculating the similarity using embeddings and the Settings.embed_model.similarity
method could indeed be due to several factors, including the impact of the node's metadata. Here are some potential reasons:
Metadata Filtering: The query
method supports filtering based on metadata. If your query includes metadata filters, the set of nodes considered for similarity scoring could be different from the set considered when directly calculating similarity scores without such filters. This could lead to discrepancies in the top nodes returned and their similarity scores.
Sparse Vector Handling: The query
method supports querying with sparse vectors, which are generated based on the text content of the nodes. The handling of sparse vectors, especially in hybrid query modes, involves adjusting the weights of the sparse vector elements and the query embedding vector elements. This adjustment could affect the similarity scores compared to direct calculations that do not consider sparse vector representations.
Namespace and Filtering: The query execution within the query
method allows for namespace-specific querying and additional filtering parameters. These aspects can limit the scope of the query to a subset of the entire index, potentially leading to differences in the results and similarity scores compared to a direct calculation that does not account for these constraints.
Score Calculation Method: The underlying implementation of the query might use optimizations or approximations for efficiency, which could lead to slight differences in the similarity scores compared to direct calculations using the Settings.embed_model.similarity
method.
You can find more details about these aspects in the llama-index-integrations/vector_stores/llama-index-vector-stores-pinecone/llama_index/vector_stores/pinecone/base.py file.
As for how the query_engine.query(qry_string)
method calculates similarity scores, it's not directly shown in the provided context. However, the context does include a method named similarity
within the BaseEmbedding
class, which is likely used for calculating similarity scores between embeddings. This method delegates the actual similarity calculation to a standalone similarity
function defined outside the class. The calculation depends on the mode specified: Euclidean Distance, Dot Product, or Cosine Similarity (Default). You can find more details about these calculations in the llama-index-core/llama_index/core/base/embeddings/base.py file.
Regarding how the query_engine.query(qry_string)
method handles metadata filtering, I wasn't able to find specific details within the repository. It's possible that the answer may be available elsewhere or I could have missed it. If you could provide more details about the specific metadata filtering you're using or any other relevant information, it would be helpful in providing a more accurate answer.
@xyzgorba what vector db are you using? The default? Something else? Many vector dbs will use approximate methods to speed up similarity search, rather than computing raw cosine similarity
@xyzgorba Also, embeddings are calculated with a different text attribute
embedding = Settings.embed_model.get_text_embedding(node.get_content(metadata_mode="embed"))
@logan-markewich - thank you for answer. Yes, it was a default vector db. Thank you for pointing to a fact that embeddings are calculated with a different text attribute. In my case the problem is not that scores are different, but that much less relevant nodes were retrieved as a result, comparing to cosine similarity
Question Validation
Question
Hello, I can't understand why similarity score for top nodes returned after
query_engine.query(qry_string)
is different comparing to direct calculation for some specific node using i.e.:
What is the possible reason? Could it be impact from node's metadata? Cannot find answer in documentation