Open JINO-ROHIT opened 1 month ago
I found a similar open issue: [Question]: responses from chromadb as vector store and qdrant are different, which is closed and was last updated on September 16, 2024.
The suggested solution is: Yes, it is possible for the responses from ChromaDB and Qdrant to be different due to the similarity metric used. The similarity metric affects the ranking of the nodes based on their similarity scores.
To change the similarity metric in Qdrant, you need to configure it in the search_batch
and search
methods of the QdrantVectorStore
class. Specifically, it is configured using the limit
parameter, which is set to query.similarity_top_k
.
There isn't a direct parameter to check which similarity metric is being used, but you can infer it from the configuration and the methods being used in the code [1].
Additionally, there are default parameters in the QdrantVectorStore
and ChromaDBVectorStore
classes that could lead to different responses. For QdrantVectorStore
:
batch_size
defaults to 64.parallel
defaults to 1.max_retries
defaults to 3.enable_hybrid
defaults to False.These default parameters can affect the behavior and performance of the vector store, such as the number of points uploaded in a single request, the number of parallel processes used during upload, the number of retries in case of failure, and whether hybrid search using dense and sparse vectors is enabled [2][3].
For ChromaDBVectorStore
, the default parameters are not explicitly mentioned, but similar configurations like batch size, parallel processing, and retry mechanisms could also influence its behavior.
To continue talking to Dosu, mention @dosu.
@dosu i have changed the similarity metric to match chromadb, now whats the issue?
The issue could be related to the need for a similarity cutoff in the query method. Ensure you are using the similarity_top_k
parameter correctly. If the results are still not reasonable, check the response methods such as response.source_nodes
and response.get_formatted_sources()
to ensure all relevant documents are being considered.
Additionally, verify that the metadata filters are correctly translated to Chroma-specific specifications. Functions like _transform_chroma_filter_condition
and _transform_chroma_filter_operator
handle this translation. Incorrect filter conditions or operators could lead to different responses.
Ensure the collection is properly initialized and that nodes are being added and queried correctly. Methods such as add
, delete
, delete_nodes
, clear
, query
, and _query
are crucial for managing and querying the Chroma collection.
Finally, confirm that the ChromaDB instance is correctly set up and accessible. Follow the setup guide for starting ChromaDB locally using Docker and running tests to ensure the connection is established:
# Start chromadb locally
# cd tests
# docker-compose up
#
# Run tests
# cd tests/vector_stores
# pytest test_chromadb.py
Make sure your ChromaDB instance is running correctly and that the connection parameters are properly configured [1][2][3][4][5].
To continue talking to Dosu, mention @dosu.
@dosu im not using any metadata filters or anything
The responses between ChromaDB and Qdrant can still differ even without using any metadata filters due to differences in their internal implementations and configurations. Specifically, Qdrant has options for hybrid search using both dense and sparse vectors, which can affect the results. Additionally, the way Qdrant handles vector storage, indexing, and querying might differ from ChromaDB, leading to variations in the responses [1][2].
For example, Qdrant allows for hybrid search configurations, which can influence the results even if the similarity metric is the same. Ensure that you have accounted for these differences in your setup.
To continue talking to Dosu, mention @dosu.
@dosu i dont have a hybrid setup
Even without a hybrid setup, the responses between ChromaDB and Qdrant can still differ due to Qdrant's additional configurations and functionalities. Qdrant supports hybrid search using both dense and sparse vectors, and even if hybrid search is not explicitly enabled, the presence of these configurations can influence how vectors are stored and queried, leading to different results compared to ChromaDB [1].
Additionally, there are known issues in Qdrant where in-memory data is not shared between async and sync clients, which could lead to inconsistencies in search results depending on the type of client used [2][1].
To continue talking to Dosu, mention @dosu.
@JINO-ROHIT its possible. Both qdrant and chroma will have different approaches to performing vector search.
Question Validation
Question
Is it possible for the nodes and responses to be different for chromadb and qdrant with the same similirity metric, chunking and ingestion?