langchain-ai / langchain

šŸ¦œšŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
88.33k stars 13.86k forks source link

similarity_score_threshold isn't working for MongoDB Atlas Vector Search #18365

Open Vishnu-add opened 4 months ago

Vishnu-add commented 4 months ago

Checked other resources

Example Code

vector_search = MongoDBAtlasVectorSearch.from_connection_string( uri, DB_NAME + "." + COLLECTION_NAME, embeddings, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, relevance_score_fn='cosine', )

qa_retriever = vector_search.as_retriever( search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.5},

)

Error Message and Stack Trace (if applicable)

UserWarning: No relevant docs were retrieved using the relevance score threshold 0.5 warnings.warn()

Description

I'm trying to use MongoDBAtlasVectorSearch, the similarity_score_threshold is used, but it is always returning an empty list. Only if the score is set to 0.0 then the documents are returned.

System Info

System Information

OS: Windows OS Version: 10.0.22631 Python Version: 3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]

Package Information

langchain_core: 0.1.22 langchain: 0.1.4 langchain_community: 0.0.19 langsmith: 0.0.87 langchain_cli: 0.0.21 langchain_openai: 0.0.5 langserve: 0.0.41

RemcoGoy commented 3 months ago

I'm experiencing the same issue, I've been diving a little bit deeper into this (or at least trying to), and I've also noticed that queries or statements that have nothing to do with the knowledge base, have consistently higher score than relevant queries. Looking at the code, the 'consine' distance is normalized by doing 1 - distance, but I'm feeling like the "distance" that is returned by MongoDB is already a score, and not the distance. Hence inverting the score to a very low one.

These are all assumptions and I'm looking for someone from LangChain/MongoDB Atlas to confirm these findings for me/us.

Honda-a commented 2 months ago

I have made a work around for this that any one can implement

class FixedMongoDBAtlasVectorSearch(MongoDBAtlasVectorSearch):
    def _similarity_search_with_relevance_scores(
        self,
        query: str,
        k: int = 4,
        **kwargs: Any,
    ) -> list[tuple[Document, float]]:
        docs_and_scores = self.similarity_search_with_score(query, k, **kwargs)
        return docs_and_scores

as @RemcoGoy have mentioned above the problem is with the problem is with VectorStore normalizing distance which in case of MongoDB is a score is already a value between 0-1 it can be fixed by just retruning the score retrieved from MongoDB without doing any normalization on it

hveigz commented 2 months ago

As per their documentation: score = (1 + cosine/dot_product(v1,v2)) / 2

https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/#atlas-vector-search-score

Anyone knows if we can get un-normalized scores from the source instead of this?