`get_relevant_documents` of Chroma retriever uses cosine distance instead of cosine similarity as similarity score

labdmitriy commented 1 year ago

System Info

LangChain version: 0.0.205 Platform: Ubuntu 20.04 LTS Python version: 3.10.4

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Steps to reproduce

Reproduce section in Similarity Score Threshold Retrieval in tutorial Vector store-backed retriever with Chroma instead of FAISS as vector store, then we will get incorrect results and get only less relevant documents instead of the most ones.

Possible reason

db.get_relevant_documents() calls db.similarity_search_with_relevance_scores() for search_type="similarity_score_threshold".
In db.similarity_search_with_relevance_scores() we can see the following description:

Return docs and relevance scores, normalized on a scale from 0 to 1. 0 is dissimilar, 1 is most similar.
db.similarity_search_with_relevance_scores() finally calls db.similarity_search_with_score(), which has the following description:

Run similarity search with Chroma with distance. ... Lower score represents more similarity.
So when score_threshold is used in db.similarity_search_with_relevance_scores():
```
docs_and_similarities = [
(doc, similarity)
for doc, similarity in docs_and_similarities
if similarity >= score_threshold
]
```
Then the filter will retain only the less relevant docs, not the most ones, because cosine distance is used as similarity score, which is not correct.

Related issues

4517
6046

Expected behavior

Cosine similarity instead of cosine distance must be used as similarity score.

batmanscode commented 1 year ago

Great catch! Perhaps there's an opportunity to create a new function to return the least relevant docs

olegshirokikh commented 1 year ago

Similar issue with Pinecone Cosine Similarity, which is converted back to distance - filtering out relevant documents when using score_threshold -- https://github.com/langchain-ai/langchain/issues/8207

dsantiago commented 1 year ago

It seems that for Chroma, you should set the distance metric when creating a collecion: https://docs.trychroma.com/usage-guide#changing-the-distance-function

The default distance in Chroma is l2. I use this function in my toy project so i change it to use cosine distance:

# Save to Chroma
def save_chroma(db_name, docs, embeddings, path="./chroma_db"):
  db = Chroma.from_documents(collection_name=db_name, 
                             documents=docs, 
                             embedding=embeddings, 
                             persist_directory=path, 
                             collection_metadata={"hnsw:space": "cosine"})
  db.persist()
  return db

dsantiago commented 1 year ago

@adrienohana

Unfortunately @dsantiago's solution does not work currently (collection_metadata is not used anywhere in the code)

Seriously! Omg.

adrienohana commented 1 year ago

@adrienohana

Unfortunately @dsantiago's solution does not work currently (collection_metadata is not used anywhere in the code)

Seriously! Omg.

Actually after digging the docs for a couple hours I realised your solution works ! When working with jupyter notebooks, re-running Chroma.from_documents many times without restarting the Kernel often leads to a corrupted database, and I kept getting the same retrieved documents no matter the selected function.

My bad and thanks for sharing !

dsantiago commented 1 year ago

Ohh ok, but i really didn't dig it. In my mind if there was collection_metadata option, it was being sent to Chroma's collection object.

But if it works, we are all good ;D

dosubot[bot] commented 1 year ago

Hi, @labdmitriy! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you raised is regarding the get_relevant_documents function in the Chroma retriever of LangChain. It seems that the function is currently using cosine distance instead of cosine similarity, resulting in less relevant documents being returned. There has been some discussion in the comments about potential solutions, including creating a new function to return the least relevant documents and a related issue with Pinecone Cosine Similarity.

However, it appears that there has been progress in resolving this issue. One user suggested changing the distance metric when creating a collection in Chroma, and another user pointed out that the solution doesn't currently work. But after further investigation, it was discovered that the solution does work. It is important to note that re-running Chroma.from_documents without restarting the Kernel can lead to a corrupted database.

Now, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

langchain-ai / langchain