langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.34k stars 14.08k forks source link

Chroma DB : Cannot return the results in a contiguous 2D array #3665

Closed achammah closed 8 months ago

achammah commented 1 year ago

Issue

Sometimes when doing search similarity using chromaDB wrapper, I run into the following issue: RuntimeError(\'Cannot return the results in a contigious 2D array. Probably ef or M is too small\')

Some background info:

ChromaDB is a library for performing similarity search on high-dimensional data. It uses an approximate nearest neighbor (ANN) search algorithm called Hierarchical Navigable Small World (HNSW) to find the most similar items to a given query. The parameters ef and M are related to the HNSW algorithm and have an impact on the search quality and performance.

  1. ef: This parameter controls the size of the dynamic search list used by the HNSW algorithm. A higher value for ef results in a more accurate search but slower search speed. A lower value will result in a faster search but less accurate results.
  2. M: This parameter determines the number of bi-directional links created for each new element during the construction of the HNSW graph. A higher value for M results in a denser graph, leading to higher search accuracy but increased memory consumption and construction time.

The error message you encountered indicates that either or both of these parameters are too small for the current dataset. This can cause issues when trying to return the search results in a contiguous 2D array. To resolve this error, you can try increasing the values of ef and M in the ChromaDB configuration or during the search query.

It's important to note that the optimal values for ef and M can depend on the specific dataset and use case. You may need to experiment with different values to find the best balance between search accuracy, speed, and memory consumption for your application.

My proposal

3 possibilities:

pseudotensor commented 12 months ago

I see this too, it seems to be arbitrarily occurring even with same document chunks and parameters. So it's not just about optimal choices, but some bug in chroma.

https://github.com/h2oai/h2ogpt/issues/505

pseudotensor commented 12 months ago

This is affecting alot of users of h2oGPT. Anyway to help figure this out? Thanks.

utrerf commented 11 months ago

Same here. I get why it happens, but it would be great to have chroma handle this gracefully.

dosubot[bot] commented 8 months ago

Hi, @achammah! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding of the issue, the ChromaDB wrapper for search similarity is returning an error message indicating that the results cannot be returned in a contiguous 2D array. Some possible solutions that have been discussed include adding optional parameters for ef and M, implementing a retrial system, or calculating the optimal values within the search function. Other users, such as @pseudotensor and @utrerf, have also reported experiencing this issue and are seeking assistance in resolving it.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. We look forward to hearing from you soon.

ahmedmuzammilAI commented 2 weeks ago

hi, I am working with chromaDB for facial recognition. and I am facing the same issue, I have used multiple face embeddings for one person for extracting the features of face from multiple angles. I would really appreciate your suggestions on how to figure out the best values for M and ef. Based on the 3 possible suggestions by @achammah how can I proceed with the second and third suggestion to make them work in python. I would really appreciate any help as I'm new to this approach.

Also please feel free to suggest a better approach for facial recognition if you have any. thanks in advance.