run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.85k stars 5.28k forks source link

[Question]: Unable to understand how document storage works in case nodes are deleted #14767

Closed BalasubramanyamEvani closed 3 weeks ago

BalasubramanyamEvani commented 4 months ago

Question Validation

Question

Hello,

My use case is as follows: When I create a vector store index (ChromaVectorStore) and persist the storage context, how should one go about deleting the nodes later? Let's say I want to delete all the nodes.

I can probably get all the IDs in the ChromaDB collection and then delete them, but as far as I understand, the deleted node IDs should be removed from the docstore.json when the storage context is persisted. How should one handle this? Or is there a way LlamaIndex handles it automatically?

Any pointers regarding this would be helpful

Thank you

dosubot[bot] commented 4 months ago

Hello @BalasubramanyamEvani! 👋 I'm here to help you with any bugs, questions, or contributions. Let's tackle this together!

To handle deleting nodes in a ChromaVectorStore index and ensure the removal of node IDs from docstore.json when the storage context is persisted, you should use the delete_nodes method provided in the ChromaVectorStore class. This method deletes nodes based on their node_ids.

Here is an example of how to use the delete_nodes method:

def test_delete_nodes(
    vector_store: ChromaVectorStore, node_embeddings: List[TextNode]
) -> None:
    vector_store.add(node_embeddings)
    vector_store.delete_nodes(
        node_ids=[
            "c330d77f-90bd-4c51-9ed2-57d8d693b3b0",
            "c3d1e1dd-8fb4-4b8f-b7ea-7fa96038d39d",
        ]
    )
    res = vector_store.get_nodes(
        node_ids=[
            "c330d77f-90bd-4c51-9ed2-57d8d693b3b0",
            "c3d1e1dd-8fb4-4b8f-b7ea-7fa96038d39d",
            "c3ew11cd-8fb4-4b8f-b7ea-7fa96038d39d",
        ]
    )
    assert len(res) == 1
    assert res[0].get_content() == "lorem ipsum"
    assert res[0].id_ == "c3ew11cd-8fb4-4b8f-b7ea-7fa96038d39d"

This method will remove the nodes associated with the given node_ids from the ChromaDB collection, which should also handle the removal of node IDs from docstore.json when the storage context is persisted [1][2][3].

Additionally, the VectorStoreIndex class in the LlamaIndex framework provides a similar method for deleting nodes. The delete_nodes method in this class ensures that nodes are removed from both the vector store and the document store if necessary:

def delete_nodes(
    self,
    node_ids: List[str],
    delete_from_docstore: bool = False,
    **delete_kwargs: Any,
) -> None:
    """Delete a list of nodes from the index.

    Args:
        node_ids (List[str]): A list of node_ids from the nodes to delete

    """
    # delete nodes from vector store
    self._vector_store.delete_nodes(node_ids, **delete_kwargs)
    # delete from docstore only if needed
    if (
        not self._vector_store.stores_text or self._store_nodes_override
    ) and delete_from_docstore:
        for node_id in node_ids:
            self._docstore.delete_document(node_id, raise_error=False)

This ensures that the nodes are properly deleted from both the vector store and the document store, maintaining consistency in the storage context [4][5].

To continue talking to Dosu, mention @dosu.

Avinash-Raj commented 4 months ago

@dosu but index._store_nodes_override has to be set in-order to remove entries from docstore.