Similarity score in nodes: how is it calculated

xyzgorba commented 3 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hello, I can't understand why similarity score for top nodes returned after

query_engine.query(qry_string)

is different comparing to direct calculation for some specific node using i.e.:

embedding = Settings.embed_model.get_text_embedding(node.text) 
query_embedding = Settings.embed_model.get_query_embedding(qry_string) 
Settings.embed_model.similarity(embedding, query_embedding)

What is the possible reason? Could it be impact from node's metadata? Cannot find answer in documentation

dosubot[bot] commented 3 months ago

Hey @xyzgorba, good to see you back in the trenches of code! Hope you're ready to untangle this mystery together. 🧐

The discrepancy you're observing between the similarity scores returned by query_engine.query(qry_string) and those obtained by directly calculating the similarity using embeddings and the Settings.embed_model.similarity method could indeed be due to several factors, including the impact of the node's metadata. Here are some potential reasons:

Metadata Filtering: The query method supports filtering based on metadata. If your query includes metadata filters, the set of nodes considered for similarity scoring could be different from the set considered when directly calculating similarity scores without such filters. This could lead to discrepancies in the top nodes returned and their similarity scores.
Sparse Vector Handling: The query method supports querying with sparse vectors, which are generated based on the text content of the nodes. The handling of sparse vectors, especially in hybrid query modes, involves adjusting the weights of the sparse vector elements and the query embedding vector elements. This adjustment could affect the similarity scores compared to direct calculations that do not consider sparse vector representations.
Namespace and Filtering: The query execution within the query method allows for namespace-specific querying and additional filtering parameters. These aspects can limit the scope of the query to a subset of the entire index, potentially leading to differences in the results and similarity scores compared to a direct calculation that does not account for these constraints.
Score Calculation Method: The underlying implementation of the query might use optimizations or approximations for efficiency, which could lead to slight differences in the similarity scores compared to direct calculations using the Settings.embed_model.similarity method.

You can find more details about these aspects in the llama-index-integrations/vector_stores/llama-index-vector-stores-pinecone/llama_index/vector_stores/pinecone/base.py file.

As for how the query_engine.query(qry_string) method calculates similarity scores, it's not directly shown in the provided context. However, the context does include a method named similarity within the BaseEmbedding class, which is likely used for calculating similarity scores between embeddings. This method delegates the actual similarity calculation to a standalone similarity function defined outside the class. The calculation depends on the mode specified: Euclidean Distance, Dot Product, or Cosine Similarity (Default). You can find more details about these calculations in the llama-index-core/llama_index/core/base/embeddings/base.py file.

Regarding how the query_engine.query(qry_string) method handles metadata filtering, I wasn't able to find specific details within the repository. It's possible that the answer may be available elsewhere or I could have missed it. If you could provide more details about the specific metadata filtering you're using or any other relevant information, it would be helpful in providing a more accurate answer.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 3 months ago

@xyzgorba what vector db are you using? The default? Something else? Many vector dbs will use approximate methods to speed up similarity search, rather than computing raw cosine similarity

logan-markewich commented 3 months ago

@xyzgorba Also, embeddings are calculated with a different text attribute

embedding = Settings.embed_model.get_text_embedding(node.get_content(metadata_mode="embed"))

xyzgorba commented 3 months ago

@logan-markewich - thank you for answer. Yes, it was a default vector db. Thank you for pointing to a fact that embeddings are calculated with a different text attribute. In my case the problem is not that scores are different, but that much less relevant nodes were retrieved as a result, comparing to cosine similarity

run-llama / llama_index