Open LikhithRishi opened 3 months ago
To add new document embeddings to the existing embeddings in chroma_db
, you can use the add
method of the ChromaVectorStore
class. This method allows you to add new embeddings to the existing collection. Here’s how you can achieve this:
Here is a step-by-step guide with code snippets:
import chromadb
from llama_index.legacy.vector_stores import ChromaVectorStore
# Initialize the Chroma client and collection
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
Assuming you have already processed the new documents and created new nodes, you can now generate embeddings for these nodes. Here’s an example of how to do this:
from llama_index.legacy.schema import NodeRelationship, RelatedNodeInfo, TextNode
# Define new nodes with their embeddings
new_nodes = [
TextNode(
text="New text 1",
id_="new-id-1",
relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-1")},
metadata={"author": "New Author 1"},
embedding=[0.1, 0.2, 0.3], # Example embedding
),
TextNode(
text="New text 2",
id_="new-id-2",
relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-2")},
metadata={"author": "New Author 2"},
embedding=[0.4, 0.5, 0.6], # Example embedding
),
]
# Add new nodes to the existing vector store
vector_store.add(new_nodes)
# Optionally, you can query to verify the addition
res = vector_store.query(
VectorStoreQuery(query_embedding=[0.1, 0.2, 0.3], similarity_top_k=1)
)
assert res.nodes
print(res.nodes[0].get_content()) # Should print "New text 1"
If you need to handle asynchronous operations, you can use the async_add
and aquery
methods:
import pytest
from llama_index.legacy.vector_stores.query import VectorStoreQuery
@pytest.mark.asyncio()
async def test_add_to_chromadb_and_query(vector_store: ChromaVectorStore, use_async: bool) -> None:
new_nodes = [
TextNode(
text="New text 1",
id_="new-id-1",
relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-1")},
metadata={"author": "New Author 1"},
embedding=[0.1, 0.2, 0.3],
),
TextNode(
text="New text 2",
id_="new-id-2",
relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-2")},
metadata={"author": "New Author 2"},
embedding=[0.4, 0.5, 0.6],
),
]
if use_async:
await vector_store.async_add(new_nodes)
res = await vector_store.aquery(
VectorStoreQuery(query_embedding=[0.1, 0.2, 0.3], similarity_top_k=1)
)
else:
vector_store.add(new_nodes)
res = vector_store.query(
VectorStoreQuery(query_embedding=[0.1, 0.2, 0.3], similarity_top_k=1)
)
assert res.nodes
print(res.nodes[0].get_content()) # Should print "New text 1"
# Example usage
# vector_store = ... # Initialize your ChromaVectorStore instance
# await test_add_to_chromadb_and_query(vector_store, use_async=True)
This approach ensures that both the previous and new embeddings are stored in the chroma_db
database [1][2].
To continue talking to Dosu, mention @dosu.
@dosu I have a set of new nodes created from newly added documents using the following code:
new_documents = [doc for doc in documents if doc.metadata['file_name'] in new_files]
@dosu
below is the code i used for creating new_nodes for new documents
new_documents = [doc for doc in documents if doc.metadata['file_name'] in new_files]
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100) base_nodes = node_parser.get_nodes_from_documents(new_documents)
for idx, node in enumerate(basenodes): node.id = f"node-{len(all_nodes_dict) + idx}"
sub_chunk_sizes = [128, 256, 512] sub_node_parsers = [ SentenceSplitter(chunk_size=c, chunk_overlap=100) for c in sub_chunk_sizes ]
new_nodes = [] for base_node in base_nodes: for n in sub_node_parsers: sub_nodes = n.get_nodes_from_documents([base_node]) sub_inodes = [ IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes ] new_nodes.extend(sub_inodes)
# Add the original node
original_node = IndexNode.from_text_node(base_node, base_node.node_id)
new_nodes.append(original_node)
How I should create embeddings for this new_nodes
# Process new documents
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)
base_nodes = node_parser.get_nodes_from_documents(new_documents)
for idx, node in enumerate(base_nodes):
node.id_ = f"node-{len(all_nodes_dict) + idx}"
sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
SentenceSplitter(chunk_size=c, chunk_overlap=100) for c in sub_chunk_sizes
]
new_nodes = []
for base_node in base_nodes:
for n in sub_node_parsers:
sub_nodes = n.get_nodes_from_documents([base_node])
sub_inodes = [
IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
]
new_nodes.extend(sub_inodes)
# Add the original node
original_node = IndexNode.from_text_node(base_node, base_node.node_id)
new_nodes.append(original_node)
I need to create embeddings for these new_nodes. How should I go about generating and handling these embeddings? Specifically, I would like to know:
How to use my embedding model to create embeddings for each new_node, so they can be added to an existing ChromaDB collection
To generate embeddings for your new_nodes
and add them to an existing ChromaDB collection, follow these steps:
import chromadb
from llama_index.legacy.vector_stores import ChromaVectorStore
# Initialize the Chroma client and collection
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
Assuming you have an embedding model, you can generate embeddings for each new_node
. Here’s an example using a hypothetical embed_model
:
from llama_index.embeddings.openai import OpenAIEmbedding
# Initialize your embedding model
embed_model = OpenAIEmbedding()
# Generate embeddings for new nodes
for node in new_nodes:
node.embedding = embed_model.get_text_embedding(node.text)
# Add new nodes to the existing vector store
vector_store.add(new_nodes)
# Optionally, you can query to verify the addition
from llama_index.legacy.vector_stores.query import VectorStoreQuery
query_embedding = embed_model.get_text_embedding("sample query text")
res = vector_store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=1))
print(res.nodes[0].get_content()) # Should print the content of the most similar node
This approach ensures that the new nodes with their embeddings are added to the existing ChromaDB collection, allowing you to query both the old and new embeddings.
For more details on generating embeddings using the embedding model in the LlamaIndex framework, refer to the resolve_embed_model
function [1][2].
To continue talking to Dosu, mention @dosu.
@dosu 2.How do we delete embeddings of a particular document from the stored in chromadb? We are able to delete nodes related to a specific document from the in-memory all_nodes_dict using the following code:
async def delete_embeddings(all_nodes_dict, filepath):
filename = os.path.basename(filepath)
print("Current all_nodes_dict:", all_nodes_dict)
# Identify the node IDs to delete based on the filename
delete_node_ids = [
node.node_id for node in all_nodes_dict.values() if node.metadata.get('file_name') == filename
]
print("Node IDs to delete:", delete_node_ids)
# Delete nodes from the in-memory dictionary
for node_id in delete_node_ids:
del all_nodes_dict[node_id]
However, we are unsure how to reflect this deletion in the stored chromadb so that the embeddings of that document are removed from the chromadb as well.
To delete embeddings of a particular document from the stored ChromaDB, you can use the delete
method of the ChromaVectorStore
class. This method allows you to specify the ref_doc_id
of the document whose embeddings you want to delete. Here's an example:
def delete_document_embeddings(vector_store: ChromaVectorStore, ref_doc_id: str) -> None:
# Delete nodes associated with the specified document ID
vector_store.delete(ref_doc_id=ref_doc_id)
# Optionally, verify that the nodes have been deleted
res = vector_store.query(VectorStoreQuery(query_embedding=[], similarity_top_k=10))
assert all(node.metadata.get("document_id") != ref_doc_id for node in res.nodes)
In this example, the delete
method is used to remove all nodes associated with the document ID specified by ref_doc_id
. After deletion, you can query the vector store to verify that no nodes with the specified document_id
remain in the database [1][2].
To continue talking to Dosu, mention @dosu.
@Dosu How do we know ref_doc_id?
Any code examples would be greatly appreciated!
To determine the ref_doc_id
for documents stored in ChromaDB, you can use the get_all_ref_doc_info
method. This method retrieves a mapping of all document IDs (ref_doc_id
) along with their associated metadata. Here's an example of how you can use this method:
def get_all_document_ids(vector_store: ChromaVectorStore):
# Retrieve all document IDs and their associated metadata
ref_doc_info = vector_store.get_all_ref_doc_info()
if ref_doc_info is not None:
for doc_id, info in ref_doc_info.items():
print(f"Document ID: {doc_id}, Metadata: {info}")
else:
print("No documents found in the database.")
# Example usage
# vector_store = ... # Initialize your ChromaVectorStore instance
# get_all_document_ids(vector_store)
This code will print out all the document IDs and their metadata, allowing you to identify the ref_doc_id
for the documents you are interested in [1][2].
To continue talking to Dosu, mention @dosu.
@dosu Failed to delete: 'ChromaVectorStore' object has no attribute 'get_all_ref_doc_info'
To retrieve document IDs from a ChromaVectorStore
when the get_all_ref_doc_info
method is not available, you can use the _get
method. This method returns a VectorStoreQueryResult
containing the IDs. Here's how you can do it:
# Assuming you have an instance of ChromaVectorStore called `chroma_vector_store`
result = chroma_vector_store._get(limit=None, where={})
document_ids = result.ids
# Print out the document IDs
for doc_id in document_ids:
print(f"Document ID: {doc_id}")
This code snippet uses the _get
method to fetch all document IDs from the ChromaVectorStore
. The where
parameter is set to an empty dictionary to retrieve all documents without any filters [1].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I'm working on a project where I have an existing folder chroma_db containing pre-generated embeddings. I want to add new embeddings from recently added documents to this existing database. Here’s what I have:
I initialize the ChromaVectorStore with pre-existing embeddings if the chroma_db folder is present. If not, I create a new VectorStoreIndex with all existing nodes.
I have a set of new documents that I process to create new nodes. These nodes are further divided into sub-chunks, and both the sub-chunks and the original chunks are added to the all_nodes_dict.
Here is the code I am using to process the new documents and update the all_nodes_dict:
What I need help with:
I want to create embeddings for these new nodes and update the existing embeddings in chroma_db so that the database contains both the previous and the new embeddings. How can I achieve this?