Open LikhithRishi opened 1 week ago
The approach of using all_nodes
(nodes + sub_nodes) works for ElasticSearch embeddings but not for ChromaDB embeddings due to differences in how these systems handle the addition and querying of embeddings.
Why does the approach of using all_nodes (nodes + sub_nodes) work for ElasticSearch embeddings but not for ChromaDB embeddings?
all_nodes
with ChromaDB.Why do we get a query ID error when attempting to reuse embeddings with all_nodes in ChromaDB?
MAX_CHUNK_SIZE
of 41665, and exceeding this limit can lead to errors. Additionally, if the _collection
attribute is not properly initialized or if there are issues with metadata filters, it can result in query ID errors [1][2].Is there a specific reason we cannot use all_nodes with ChromaDB, and if so, could you provide guidance on how to properly manage this?
@pytest.mark.asyncio()
@pytest.mark.parametrize("use_async", [True, False])
async def test_add_to_chromadb_and_query(
vector_store: ChromaVectorStore,
node_embeddings: List[TextNode],
use_async: bool,
) -> None:
if use_async:
await vector_store.async_add(node_embeddings)
res = await vector_store.aquery(
VectorStoreQuery(query_embedding=[1.0, 0.0, 0.0], similarity_top_k=1)
)
else:
vector_store.add(node_embeddings)
res = vector_store.query(
VectorStoreQuery(query_embedding=[1.0, 0.0, 0.0], similarity_top_k=1)
)
assert res.nodes
assert res.nodes[0].get_content() == "lorem ipsum"
Additionally, ensure that the ChromaVectorStore
is correctly initialized with the necessary parameters and that the MAX_CHUNK_SIZE
is not exceeded [3][4].
By following these guidelines, you can manage the embeddings correctly and avoid the query ID error with ChromaDB.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I encountered an issue with the usage of nodes in VectorStoreIndex when working with ElasticSearch embeddings and ChromaDB embeddings.
Context:
In the case of dense_x retrieval using ElasticSearch embeddings, we use all_nodes (nodes + sub_nodes) in the VectorStoreIndex:
This approach works perfectly; we can reuse embeddings and get the expected responses.
However, when using ChromaDB embeddings, if we use only nodes in the VectorStoreIndex, it works fine:
But, if we attempt to use all_nodes (nodes + sub_nodes) as we do with ElasticSearch, it only works for the first time (i.e., during the creation of embeddings). When we try to reuse those embeddings to fetch a response, we encounter the following error:
ValueError: Query id d28e3de4-1c4f-420a-9184-97bf8556b11b not found in either retriever_dict or query_engine_dict.
Questions: