run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.51k stars 5.22k forks source link

[Question]: how to upsert nodes to refresh nodes name, properties and embeddings in Neo4j property graph?? #16826

Closed justinzyw closed 2 hours ago

justinzyw commented 2 hours ago

Question Validation

Question

I have a neo4j property graph. And I would like to refresh the node's name, properties and related embeddings. How can I achieve it by upserting the nodes?

dosubot[bot] commented 2 hours ago

To upsert nodes in a Neo4j property graph and refresh the node's name, properties, and related embeddings, you can use the upsert_nodes method. This method merges nodes based on their IDs and updates their properties and embeddings. Here's a brief overview of how it works:

  1. Separate Nodes by Type: Nodes are categorized into entity_dicts and chunk_dicts based on their type (EntityNode or ChunkNode).

  2. Upsert Chunk Nodes: For each chunk in chunk_dicts, a Cypher query is executed to:

    • Merge the node based on its ID.
    • Set the node's text and properties.
  3. Upsert Entity Nodes: For each chunk in entity_dicts, a Cypher query is executed to:

    • Merge the node based on its ID.
    • Set the node's properties and name.
    • Optionally, create a relationship if a triplet_source_id is provided.

This approach ensures that the node's name, properties, and embeddings are refreshed in the Neo4j graph. Here's a code snippet illustrating the method:

def upsert_nodes(self, nodes: List[LabelledNode]) -> None:
    # Lists to hold separated types
    entity_dicts: List[dict] = []
    chunk_dicts: List[dict] = []

    # Sort by type
    for item in nodes:
        if isinstance(item, EntityNode):
            entity_dicts.append({**item.dict(), "id": item.id})
        elif isinstance(item, ChunkNode):
            chunk_dicts.append({**item.dict(), "id": item.id})
        else:
            pass

    if chunk_dicts:
        for index in range(0, len(chunk_dicts), CHUNK_SIZE):
            chunked_params = chunk_dicts[index : index + CHUNK_SIZE]
            for param in chunked_params:
                formatted_properties = ", ".join(
                    [
                        f"{key}: {value!r}"
                        for key, value in param["properties"].items()
                    ]
                )
                self.structured_query(
                    f"""
                    MERGE (c:{BASE_NODE_LABEL} {{id: '{param["id"]}'}})
                    SET c.`text` = '{param["text"]}', c:Chunk
                    WITH c
                    SET c += {{{formatted_properties}}}
                    RETURN count(*)
                    """
                )

    if entity_dicts:
        for index in range(0, len(entity_dicts), CHUNK_SIZE):
            chunked_params = entity_dicts[index : index + CHUNK_SIZE]
            for param in chunked_params:
                formatted_properties = ", ".join(
                    [
                        f"{key}: {value!r}"
                        for key, value in param["properties"].items()
                    ]
                )
                self.structured_query(
                    f"""
                    MERGE (e:{BASE_NODE_LABEL} {{id: '{param["id"]}'}})
                    SET e += {{{formatted_properties}}}
                    SET e.name = '{param["name"]}', e:`{BASE_ENTITY_LABEL}`
                    WITH e
                    SET e :{param["label"]}
                    """
                )
                triplet_source_id = param["properties"].get("triplet_source_id")
                if triplet_source_id:
                    self.structured_query(
                        f"""
                        MERGE (e:{BASE_NODE_LABEL} {{id: '{param["id"]}'}})
                        MERGE (c:{BASE_NODE_LABEL} {{id: '{triplet_source_id}'}})
                        MERGE (e)<-[:MENTIONS]-(c)
                        """
                    )

This method ensures that the nodes are updated or created as needed, maintaining the integrity of the graph [1].

To continue talking to Dosu, mention @dosu.

justinzyw commented 2 hours ago

@dosu, the method above directly manipulates the db but it can not refresh the embeddings according to the new name and properties. it has to invoke lllm in order to generate embeddings accordingly.

dosubot[bot] commented 2 hours ago

To refresh embeddings based on updated node names and properties, you need to invoke the LLM to generate new embeddings. You can use the embed_nodes function for text nodes or the embed_image_nodes function for image nodes. These functions check if a node's embedding is None and generate new embeddings using the provided embedding model. Here's how you can do it:

For text nodes:

from llama_index.legacy.embeddings.base import BaseEmbedding
from llama_index.legacy.schema import BaseNode

# Assume nodes is a sequence of BaseNode and embed_model is an instance of BaseEmbedding
nodes = [...]  # Your list of nodes
embed_model = BaseEmbedding(...)  # Your embedding model

# Generate new embeddings
id_to_embed_map = embed_nodes(nodes, embed_model, show_progress=True)

For image nodes:

from llama_index.legacy.embeddings.multi_modal_base import MultiModalEmbedding
from llama_index.legacy.schema import ImageNode

# Assume nodes is a sequence of ImageNode and embed_model is an instance of MultiModalEmbedding
nodes = [...]  # Your list of image nodes
embed_model = MultiModalEmbedding(...)  # Your embedding model

# Generate new image embeddings
id_to_embed_map = embed_image_nodes(nodes, embed_model, show_progress=True)

These functions return a dictionary mapping node IDs to their corresponding embeddings. You can then update the Neo4j graph with these new embeddings [1][2].

To continue talking to Dosu, mention @dosu.