Open zhichenggeng opened 2 months ago
There are some issues mentioned about the relevance score problem previously: https://github.com/langchain-ai/langchain/issues/9519, https://github.com/langchain-ai/langchain/issues/22209, and https://github.com/langchain-ai/langchain/issues/14948. The normalize_L2
problem is never discussed. I guess that's because many embedding models normalize the embeddings, while titan embedding model doesn't do that.
Running the above code will give the following output:
[(Document(page_content='I like apples'), 388.789)]
[(Document(page_content='I like apples'), 1.0000002)]
[(Document(page_content='I like apples'), -2.384185791015625e-07)]
If this behavior is truly unexpected, I could help submit a PR to solve this.
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
According to FAISS, the cosine similarity can be obtained by normalizing the vectors first and then using inner product to build the index. However, running the code above, when setting
normalize_L2 = True
, there will be a warning:Moreover, the relevance score is counterintuitive. If we are computing cosine similarity, the relevance score from
similarity_search_with_relevance_scores
should be identity to the score fromsimilarity_search_with_score
. However, the implementation will lead to smaller relevance score when two vectors are closer. https://github.com/langchain-ai/langchain/blob/fd546196ef0fafa4a4cd7bb7ebb1771ef599f372/libs/core/langchain_core/vectorstores/base.py#L422-L427System Info
System Information
Package Information