Closed labdmitriy closed 1 year ago
Great catch! Perhaps there's an opportunity to create a new function to return the least relevant docs
Similar issue with Pinecone Cosine Similarity, which is converted back to distance - filtering out relevant documents when using score_threshold -- https://github.com/langchain-ai/langchain/issues/8207
It seems that for Chroma, you should set the distance metric when creating a collecion: https://docs.trychroma.com/usage-guide#changing-the-distance-function
The default distance in Chroma is l2
. I use this function in my toy project so i change it to use cosine distance:
# Save to Chroma
def save_chroma(db_name, docs, embeddings, path="./chroma_db"):
db = Chroma.from_documents(collection_name=db_name,
documents=docs,
embedding=embeddings,
persist_directory=path,
collection_metadata={"hnsw:space": "cosine"})
db.persist()
return db
@adrienohana
Unfortunately @dsantiago's solution does not work currently (collection_metadata is not used anywhere in the code)
Seriously! Omg.
@adrienohana
Unfortunately @dsantiago's solution does not work currently (collection_metadata is not used anywhere in the code)
Seriously! Omg.
Actually after digging the docs for a couple hours I realised your solution works ! When working with jupyter notebooks, re-running Chroma.from_documents many times without restarting the Kernel often leads to a corrupted database, and I kept getting the same retrieved documents no matter the selected function.
My bad and thanks for sharing !
Ohh ok, but i really didn't dig it. In my mind if there was collection_metadata option, it was being sent to Chroma's collection object.
But if it works, we are all good ;D
Hi, @labdmitriy! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
Based on my understanding, the issue you raised is regarding the get_relevant_documents
function in the Chroma retriever of LangChain. It seems that the function is currently using cosine distance instead of cosine similarity, resulting in less relevant documents being returned. There has been some discussion in the comments about potential solutions, including creating a new function to return the least relevant documents and a related issue with Pinecone Cosine Similarity.
However, it appears that there has been progress in resolving this issue. One user suggested changing the distance metric when creating a collection in Chroma, and another user pointed out that the solution doesn't currently work. But after further investigation, it was discovered that the solution does work. It is important to note that re-running Chroma.from_documents
without restarting the Kernel can lead to a corrupted database.
Now, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!
System Info
LangChain version: 0.0.205 Platform: Ubuntu 20.04 LTS Python version: 3.10.4
Who can help?
No response
Information
Related Components
Reproduction
Steps to reproduce
Possible reason
db.get_relevant_documents()
callsdb.similarity_search_with_relevance_scores()
forsearch_type="similarity_score_threshold"
.In
db.similarity_search_with_relevance_scores()
we can see the following description:db.similarity_search_with_relevance_scores()
finally callsdb.similarity_search_with_score()
, which has the following description:So when
score_threshold
is used indb.similarity_search_with_relevance_scores()
:Then the filter will retain only the less relevant docs, not the most ones, because cosine distance is used as similarity score, which is not correct.
Related issues
4517
6046
Expected behavior
Cosine similarity instead of cosine distance must be used as similarity score.