Pipeline creates Chunks with duplicate ids when executed multiple times

risafj commented 6 days ago

When I run the Pipeline() on a loop with multiple documents, a Chunk node with an id property of ":1" and index of 1 is created for each run. This causes problems, since the ids are no longer unique.

For example, when the lexical graph gets created, a Chunk node with an id of ":1" has a NEXT_NODE relation to every Chunk node that has an id of ":2".

After running the pipeline with 4 documents, it looks like this:

The same issue is occuring with FROM_CHUNK, where an entity that's supposed to have a relation like (n:Entity)-[:FROM_CHUNK]->(c:Chunk {id: ":1", index: "1"}) actually has that relation to all documents' chunks with an index of 1.

Is there any workaround for this? I'm guessing this issue would be solved if I could somehow pass document-specific id_prefix so each chunk gets a unique id?

https://github.com/neo4j/neo4j-graphrag-python/blob/bc6dd9c7b3f8fcfffb9ed360648ea80c6cbb17dc/src/neo4j_graphrag/experimental/components/lexical_graph.py#L78-L79

Additional info: I use v1.2.0. I have a standard pipeline setup that has these components.

    pipe = Pipeline()
    # skipping the config code
    pipe.add_component(text_splitter, "splitter")
    pipe.add_component(embedder, "chunk_embedder")
    pipe.add_component(schema_builder, "schema")
    pipe.add_component(extractor, "extractor")
    pipe.add_component(writer, "writer")
    pipe.add_component(resolver, "resolver")

stellasia commented 6 days ago

Hi @risafj ,

Indeed, this behavior is quite annoying, we'll take a closer look.

In the meantime, you can control this prefix by setting it in a LexicalGraphConfig, which is a run parameter of the entity and relation extractor.

So you code will look like this:

from neo4j_graphrag.experimental.components.types import LexicalGraphConfig

config = LexicalGraphConfig(
    id_prefix="myPrefix",
)

await pipe.run(data={
   # ...
   "extractor": {
      # ...
      "lexical_graph_config": config,
   }
})

Let me know if you need more assistance.

stellasia commented 6 days ago

Are you using a custom entity and relation extractor?

risafj commented 5 days ago

Hi @stellasia ,

Thank you so much for the quick turnaround and helpful response! Your solution worked perfectly!

Are you using a custom entity and relation extractor?

No, I'm using the one defined in this library:

from neo4j_graphrag.experimental.components.entity_relation_extractor import (
    LLMEntityRelationExtractor, OnError)

extractor = LLMEntityRelationExtractor(
    llm=llm,
    on_error=OnError.RAISE,
    prompt_template=custom_prompt,
)

stellasia commented 5 days ago

Thank you for raising the issue and the information, we will investigate this shortly.

neo4j / neo4j-graphrag-python

Pipeline creates Chunks with duplicate ids when executed multiple times #221