run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.38k stars 5.19k forks source link

How to add new document embeddings to existing embeddings in chromadb #15082

Open LikhithRishi opened 3 months ago

LikhithRishi commented 3 months ago

Question Validation

Question

if os.path.exists('./chroma_db'):
        print("*************************utilizing pre generated embeddings from chromadb folder")
        chroma_client = chromadb.PersistentClient(path="./chroma_db")
        chroma_collection = chroma_client.get_or_create_collection("quickstart")
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
        vector_index_chunk = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model, show_progress=True)#, use_async=True, show_progress=True)
    else:
        chroma_client = chromadb.PersistentClient(path="./chroma_db")
        chroma_collection = chroma_client.get_or_create_collection("quickstart")
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        vector_index_chunk = VectorStoreIndex(all_nodes, embed_model=embed_model, storage_context=storage_context, show_progress=True)#, use_async=True)

vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=5)
    retriever_chunk = RecursiveRetriever(
        "vector",
        retriever_dict={"vector": vector_retriever_chunk},
        node_dict=all_nodes_dict,
        verbose=True,
    )

I'm working on a project where I have an existing folder chroma_db containing pre-generated embeddings. I want to add new embeddings from recently added documents to this existing database. Here’s what I have:

I initialize the ChromaVectorStore with pre-existing embeddings if the chroma_db folder is present. If not, I create a new VectorStoreIndex with all existing nodes.

I have a set of new documents that I process to create new nodes. These nodes are further divided into sub-chunks, and both the sub-chunks and the original chunks are added to the all_nodes_dict.

Here is the code I am using to process the new documents and update the all_nodes_dict:

new_documents = [doc for doc in documents if doc.metadata['file_name'] in new_files]

# Process new documents
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)
base_nodes = node_parser.get_nodes_from_documents(new_documents)

for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{len(all_nodes_dict) + idx}"

sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
    SentenceSplitter(chunk_size=c, chunk_overlap=100) for c in sub_chunk_sizes
]

new_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        new_nodes.extend(sub_inodes)

    # Add the original node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    new_nodes.append(original_node)

# Update all_nodes_dict with new nodes
for n in new_nodes:
    all_nodes_dict[n.node_id] = n

# Save the updated dictionary
with open(file_path, 'wb') as f:
    pickle.dump(all_nodes_dict, f)

What I need help with:

I want to create embeddings for these new nodes and update the existing embeddings in chroma_db so that the database contains both the previous and the new embeddings. How can I achieve this?

dosubot[bot] commented 3 months ago

To add new document embeddings to the existing embeddings in chroma_db, you can use the add method of the ChromaVectorStore class. This method allows you to add new embeddings to the existing collection. Here’s how you can achieve this:

  1. Initialize the ChromaVectorStore with the existing collection.
  2. Create embeddings for the new nodes.
  3. Add the new nodes to the existing vector store.

Here is a step-by-step guide with code snippets:

Step 1: Initialize the ChromaVectorStore

import chromadb
from llama_index.legacy.vector_stores import ChromaVectorStore

# Initialize the Chroma client and collection
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

Step 2: Create Embeddings for the New Nodes

Assuming you have already processed the new documents and created new nodes, you can now generate embeddings for these nodes. Here’s an example of how to do this:

from llama_index.legacy.schema import NodeRelationship, RelatedNodeInfo, TextNode

# Define new nodes with their embeddings
new_nodes = [
    TextNode(
        text="New text 1",
        id_="new-id-1",
        relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-1")},
        metadata={"author": "New Author 1"},
        embedding=[0.1, 0.2, 0.3],  # Example embedding
    ),
    TextNode(
        text="New text 2",
        id_="new-id-2",
        relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-2")},
        metadata={"author": "New Author 2"},
        embedding=[0.4, 0.5, 0.6],  # Example embedding
    ),
]

Step 3: Add the New Nodes to the Existing Vector Store

# Add new nodes to the existing vector store
vector_store.add(new_nodes)

# Optionally, you can query to verify the addition
res = vector_store.query(
    VectorStoreQuery(query_embedding=[0.1, 0.2, 0.3], similarity_top_k=1)
)
assert res.nodes
print(res.nodes[0].get_content())  # Should print "New text 1"

Asynchronous Operations (Optional)

If you need to handle asynchronous operations, you can use the async_add and aquery methods:

import pytest
from llama_index.legacy.vector_stores.query import VectorStoreQuery

@pytest.mark.asyncio()
async def test_add_to_chromadb_and_query(vector_store: ChromaVectorStore, use_async: bool) -> None:
    new_nodes = [
        TextNode(
            text="New text 1",
            id_="new-id-1",
            relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-1")},
            metadata={"author": "New Author 1"},
            embedding=[0.1, 0.2, 0.3],
        ),
        TextNode(
            text="New text 2",
            id_="new-id-2",
            relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="new-test-2")},
            metadata={"author": "New Author 2"},
            embedding=[0.4, 0.5, 0.6],
        ),
    ]

    if use_async:
        await vector_store.async_add(new_nodes)
        res = await vector_store.aquery(
            VectorStoreQuery(query_embedding=[0.1, 0.2, 0.3], similarity_top_k=1)
        )
    else:
        vector_store.add(new_nodes)
        res = vector_store.query(
            VectorStoreQuery(query_embedding=[0.1, 0.2, 0.3], similarity_top_k=1)
        )

    assert res.nodes
    print(res.nodes[0].get_content())  # Should print "New text 1"

# Example usage
# vector_store = ...  # Initialize your ChromaVectorStore instance
# await test_add_to_chromadb_and_query(vector_store, use_async=True)

This approach ensures that both the previous and new embeddings are stored in the chroma_db database [1][2].

To continue talking to Dosu, mention @dosu.

LikhithRishi commented 2 months ago

@dosu I have a set of new nodes created from newly added documents using the following code:

new_documents = [doc for doc in documents if doc.metadata['file_name'] in new_files]
            @dosu
below is the code i used for creating new_nodes for new documents 

new_documents = [doc for doc in documents if doc.metadata['file_name'] in new_files]

Process new documents

node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100) base_nodes = node_parser.get_nodes_from_documents(new_documents)

for idx, node in enumerate(basenodes): node.id = f"node-{len(all_nodes_dict) + idx}"

sub_chunk_sizes = [128, 256, 512] sub_node_parsers = [ SentenceSplitter(chunk_size=c, chunk_overlap=100) for c in sub_chunk_sizes ]

new_nodes = [] for base_node in base_nodes: for n in sub_node_parsers: sub_nodes = n.get_nodes_from_documents([base_node]) sub_inodes = [ IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes ] new_nodes.extend(sub_inodes)

# Add the original node
original_node = IndexNode.from_text_node(base_node, base_node.node_id)
new_nodes.append(original_node)

How I should create embeddings for this new_nodes
# Process new documents
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)
base_nodes = node_parser.get_nodes_from_documents(new_documents)

for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{len(all_nodes_dict) + idx}"

sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
    SentenceSplitter(chunk_size=c, chunk_overlap=100) for c in sub_chunk_sizes
]

new_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        new_nodes.extend(sub_inodes)

    # Add the original node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    new_nodes.append(original_node)

I need to create embeddings for these new_nodes. How should I go about generating and handling these embeddings? Specifically, I would like to know:

How to use my embedding model to create embeddings for each new_node, so they can be added to an existing ChromaDB collection

dosubot[bot] commented 2 months ago

To generate embeddings for your new_nodes and add them to an existing ChromaDB collection, follow these steps:

  1. Initialize the ChromaVectorStore with the existing collection.
  2. Generate embeddings for the new nodes using your embedding model.
  3. Add the new nodes with embeddings to the existing vector store.

Step 1: Initialize the ChromaVectorStore

import chromadb
from llama_index.legacy.vector_stores import ChromaVectorStore

# Initialize the Chroma client and collection
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

Step 2: Generate Embeddings for the New Nodes

Assuming you have an embedding model, you can generate embeddings for each new_node. Here’s an example using a hypothetical embed_model:

from llama_index.embeddings.openai import OpenAIEmbedding

# Initialize your embedding model
embed_model = OpenAIEmbedding()

# Generate embeddings for new nodes
for node in new_nodes:
    node.embedding = embed_model.get_text_embedding(node.text)

Step 3: Add the New Nodes to the Existing Vector Store

# Add new nodes to the existing vector store
vector_store.add(new_nodes)

# Optionally, you can query to verify the addition
from llama_index.legacy.vector_stores.query import VectorStoreQuery

query_embedding = embed_model.get_text_embedding("sample query text")
res = vector_store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=1))
print(res.nodes[0].get_content())  # Should print the content of the most similar node

This approach ensures that the new nodes with their embeddings are added to the existing ChromaDB collection, allowing you to query both the old and new embeddings.

For more details on generating embeddings using the embedding model in the LlamaIndex framework, refer to the resolve_embed_model function [1][2].

To continue talking to Dosu, mention @dosu.

LikhithRishi commented 3 weeks ago

@dosu 2.How do we delete embeddings of a particular document from the stored in chromadb? We are able to delete nodes related to a specific document from the in-memory all_nodes_dict using the following code:

async def delete_embeddings(all_nodes_dict, filepath):
    filename = os.path.basename(filepath)
    print("Current all_nodes_dict:", all_nodes_dict)

    # Identify the node IDs to delete based on the filename
    delete_node_ids = [
        node.node_id for node in all_nodes_dict.values() if node.metadata.get('file_name') == filename
    ]
    print("Node IDs to delete:", delete_node_ids)

    # Delete nodes from the in-memory dictionary
    for node_id in delete_node_ids:
        del all_nodes_dict[node_id]

However, we are unsure how to reflect this deletion in the stored chromadb so that the embeddings of that document are removed from the chromadb as well.

dosubot[bot] commented 3 weeks ago

To delete embeddings of a particular document from the stored ChromaDB, you can use the delete method of the ChromaVectorStore class. This method allows you to specify the ref_doc_id of the document whose embeddings you want to delete. Here's an example:

def delete_document_embeddings(vector_store: ChromaVectorStore, ref_doc_id: str) -> None:
    # Delete nodes associated with the specified document ID
    vector_store.delete(ref_doc_id=ref_doc_id)

    # Optionally, verify that the nodes have been deleted
    res = vector_store.query(VectorStoreQuery(query_embedding=[], similarity_top_k=10))
    assert all(node.metadata.get("document_id") != ref_doc_id for node in res.nodes)

In this example, the delete method is used to remove all nodes associated with the document ID specified by ref_doc_id. After deletion, you can query the vector store to verify that no nodes with the specified document_id remain in the database [1][2].

To continue talking to Dosu, mention @dosu.

LikhithRishi commented 3 weeks ago

@Dosu How do we know ref_doc_id?

Any code examples would be greatly appreciated!

dosubot[bot] commented 3 weeks ago

To determine the ref_doc_id for documents stored in ChromaDB, you can use the get_all_ref_doc_info method. This method retrieves a mapping of all document IDs (ref_doc_id) along with their associated metadata. Here's an example of how you can use this method:

def get_all_document_ids(vector_store: ChromaVectorStore):
    # Retrieve all document IDs and their associated metadata
    ref_doc_info = vector_store.get_all_ref_doc_info()

    if ref_doc_info is not None:
        for doc_id, info in ref_doc_info.items():
            print(f"Document ID: {doc_id}, Metadata: {info}")
    else:
        print("No documents found in the database.")

# Example usage
# vector_store = ...  # Initialize your ChromaVectorStore instance
# get_all_document_ids(vector_store)

This code will print out all the document IDs and their metadata, allowing you to identify the ref_doc_id for the documents you are interested in [1][2].

To continue talking to Dosu, mention @dosu.

LikhithRishi commented 3 weeks ago

@dosu Failed to delete: 'ChromaVectorStore' object has no attribute 'get_all_ref_doc_info'

dosubot[bot] commented 3 weeks ago

To retrieve document IDs from a ChromaVectorStore when the get_all_ref_doc_info method is not available, you can use the _get method. This method returns a VectorStoreQueryResult containing the IDs. Here's how you can do it:

# Assuming you have an instance of ChromaVectorStore called `chroma_vector_store`
result = chroma_vector_store._get(limit=None, where={})
document_ids = result.ids

# Print out the document IDs
for doc_id in document_ids:
    print(f"Document ID: {doc_id}")

This code snippet uses the _get method to fetch all document IDs from the ChromaVectorStore. The where parameter is set to an empty dictionary to retrieve all documents without any filters [1].

To continue talking to Dosu, mention @dosu.