langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.61k stars 15.31k forks source link

`get_relevant_documents` of Chroma retriever uses cosine distance instead of cosine similarity as similarity score #6481

Closed labdmitriy closed 1 year ago

labdmitriy commented 1 year ago

System Info

LangChain version: 0.0.205 Platform: Ubuntu 20.04 LTS Python version: 3.10.4

Who can help?

No response

Information

Related Components

Reproduction

Steps to reproduce

Possible reason

Related issues

Expected behavior

Cosine similarity instead of cosine distance must be used as similarity score.

batmanscode commented 1 year ago

Great catch! Perhaps there's an opportunity to create a new function to return the least relevant docs

olegshirokikh commented 1 year ago

Similar issue with Pinecone Cosine Similarity, which is converted back to distance - filtering out relevant documents when using score_threshold -- https://github.com/langchain-ai/langchain/issues/8207

dsantiago commented 1 year ago

It seems that for Chroma, you should set the distance metric when creating a collecion: https://docs.trychroma.com/usage-guide#changing-the-distance-function

The default distance in Chroma is l2. I use this function in my toy project so i change it to use cosine distance:

# Save to Chroma
def save_chroma(db_name, docs, embeddings, path="./chroma_db"):
  db = Chroma.from_documents(collection_name=db_name, 
                             documents=docs, 
                             embedding=embeddings, 
                             persist_directory=path, 
                             collection_metadata={"hnsw:space": "cosine"})
  db.persist()
  return db
dsantiago commented 1 year ago

@adrienohana

Unfortunately @dsantiago's solution does not work currently (collection_metadata is not used anywhere in the code)

Seriously! Omg.

adrienohana commented 1 year ago

@adrienohana

Unfortunately @dsantiago's solution does not work currently (collection_metadata is not used anywhere in the code)

Seriously! Omg.

Actually after digging the docs for a couple hours I realised your solution works ! When working with jupyter notebooks, re-running Chroma.from_documents many times without restarting the Kernel often leads to a corrupted database, and I kept getting the same retrieved documents no matter the selected function.

My bad and thanks for sharing !

dsantiago commented 1 year ago

Ohh ok, but i really didn't dig it. In my mind if there was collection_metadata option, it was being sent to Chroma's collection object.

But if it works, we are all good ;D

dosubot[bot] commented 1 year ago

Hi, @labdmitriy! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you raised is regarding the get_relevant_documents function in the Chroma retriever of LangChain. It seems that the function is currently using cosine distance instead of cosine similarity, resulting in less relevant documents being returned. There has been some discussion in the comments about potential solutions, including creating a new function to return the least relevant documents and a related issue with Pinecone Cosine Similarity.

However, it appears that there has been progress in resolving this issue. One user suggested changing the distance metric when creating a collection in Chroma, and another user pointed out that the solution doesn't currently work. But after further investigation, it was discovered that the solution does work. It is important to note that re-running Chroma.from_documents without restarting the Kernel can lead to a corrupted database.

Now, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!