Investigate MERGE and constraints for use in preventing duplicate and stray nodes and relationships

NoisyFlowers commented 3 years ago

The current PBOT.cypher script uses CREATE for everything. If run twice, we would end up with everything duplicated in the db. The ways to prevent this provided by Neo4J are by using MERGE instead of CREATE, by using constraints, or some combination of the two.

Look further into our options here.

NoisyFlowers commented 3 years ago

Regarding the use of constraints:

The constraint type that will be of most use to us in preventing duplicate nodes is the "node key constraint". From cypher documentation:

Node key constraints Node key constraints ensure that, for a given label and set of properties:

All the properties exist on all the nodes with that label. The combination of the property values is unique.

There is nothing similar for relationships. The only relationship constraint is the "relationship property existence constraint", and I'm not sure how useful that will be for us.

Relationship property existence constraints Property existence constraints ensure that a property exists for all relationships with a specific type. All queries that try to create relationships of the specified type, but without this property, will fail. The same is true for queries that try to remove the mandatory property.

So, it seems like we will have to rely on MERGE to prevent the situation described in #15.

In addition, there is another potential problem with using constraints: they are only available in the Enterprise edition.

Node key constraints, node property existence constraints and relationship property existence constraints are only available in Neo4j Enterprise Edition. Databases containing one of these constraint types cannot be opened using Neo4j Community Edition.

What sort of license will this project ultimately be running under? Right now, with Neo4J Desktop, we have a development license for Enterprise that is limited to developmental purposes on the local machine of the registered user. My understanding of this is that we would not be able to use this license to host our project for external consumption. More info here: https://neo4j.com/licensing/

NoisyFlowers commented 3 years ago

Regarding use of MERGE:

Changing from CREATE to MERGE involves pulling all dynamic data out of the clause that describes what you are trying to CREATE and moving it into the ON CREATE clause of MERGE. If you do not do this, you end up with duplicate pieces (like a duplicate ENTERED_BY relationship because of the timestamp, or an entirely new node because of the uuid).

This can be a really big deal. For instance, take a look at the difference in this Character complex:

MATCH (person:Person {given: "Rebecca", surname: "Koll"}), (schema:Schema {title: "Smithsonian, 1999"})
CREATE
    (character:Character {characterID: apoc.create.uuid(), name: "Leaf Organization"})-[:CHARACTER_OF]->(schema),
    (character)-[:ENTERED_BY {timestamp: datetime()}]->(person),
        (state1:State {stateID: apoc.create.uuid(), name: "simple", definition: "consisting of a single lamina"})-[:STATE_OF]->(character),
        (state1)-[:ENTERED_BY {timestamp: datetime()}]->(person),
        (state2:State {stateID: apoc.create.uuid(), name: "palmately compound", definition: "a leaf with separate subunits attached at the apex of a petiole"})-[:STATE_OF]->(character),
        (state2)-[:ENTERED_BY {timestamp: datetime()}]->(person),
            (state21:State {stateID: apoc.create.uuid(), name: "sessile", definition: "subunits attached directly to apex petiole"})-[:STATE_OF]->(state2),
            (state21)-[:ENTERED_BY {timestamp: datetime()}]->(person),
            (state22:State {stateID: apoc.create.uuid(), name: "petiolulate", definition: "subunits have their own petioles, which then attach to the apex petiole"})-[:STATE_OF]->(state2),
            (state22)-[:ENTERED_BY {timestamp: datetime()}]->(person),
        (state3:State {stateID: apoc.create.uuid(), name: "trifoliate", definition: "a compound leaf with three leaflets"})-[:STATE_OF]->(character),
        (state3)-[:ENTERED_BY {timestamp: datetime()}]->(person),
        (state4:State {stateID: apoc.create.uuid(), name: "pinnately compound", definition: "a leaf with leaflets arranged along a rachis"})-[:STATE_OF]->(character),
        (state4)-[:ENTERED_BY {timestamp: datetime()}]->(person),
            (state41:State {stateID: apoc.create.uuid(), name: "odd-pinnate", definition: "rachis contains an odd number of leaflets"})-[:STATE_OF]->(state4),
            (state41)-[:ENTERED_BY {timestamp: datetime()}]->(person),
            (state42:State {stateID: apoc.create.uuid(), name: "even-pinnate", definition: "rachis contains an even number of leaflets"})-[:STATE_OF]->(state4),
            (state42)-[:ENTERED_BY {timestamp: datetime()}]->(person),
        (state5:State {stateID: apoc.create.uuid(), name: "bipinnate", definition: "compound leaf dissected twice with leaflets arranged along rachillae that are attached to the rachis (alternate term is twice pinnatley compound)"})-[:STATE_OF]->(character),
        (state5)-[:ENTERED_BY {timestamp: datetime()}]->(person),
        (state6:State {stateID: apoc.create.uuid(), name: "tripinnate", definition: "a compound leaf with leaflets attached to secondary rachillae that are in turn attached to rachillae, which are borne on the rachis (alternate term is thrice pinnately compound)"})-[:STATE_OF]->(character),
        (state6)-[:ENTERED_BY {timestamp: datetime()}]->(person);

MATCH (person:Person {given: "Rebecca", surname: "Koll"}), (schema:Schema {title: "Smithsonian, 1999"})
MERGE
    (character:Character {name: "Leaf Organization"})-[:CHARACTER_OF]->(schema),
    (character)-[character_entered_by:ENTERED_BY]->(person),
        (state1:State {name: "simple", definition: "consisting of a single lamina"})-[:STATE_OF]->(character),
        (state1)-[state1_entered_by:ENTERED_BY]->(person),
        (state2:State {name: "palmately compound", definition: "a leaf with separate subunits attached at the apex of a petiole"})-[:STATE_OF]->(character),
        (state2)-[state2_entered_by:ENTERED_BY]->(person),
            (state21:State {name: "sessile", definition: "subunits attached directly to apex petiole"})-[:STATE_OF]->(state2),
            (state21)-[state21_entered_by:ENTERED_BY]->(person),
            (state22:State {name: "petiolulate", definition: "subunits have their own petioles, which then attach to the apex petiole"})-[:STATE_OF]->(state2),
            (state22)-[state22_entered_by:ENTERED_BY]->(person),
        (state3:State {name: "trifoliate", definition: "a compound leaf with three leaflets"})-[:STATE_OF]->(character),
        (state3)-[state3_entered_by:ENTERED_BY]->(person),
        (state4:State {name: "pinnately compound", definition: "a leaf with leaflets arranged along a rachis"})-[:STATE_OF]->(character),
        (state4)-[state4_entered_by:ENTERED_BY]->(person),
            (state41:State {name: "odd-pinnate", definition: "rachis contains an odd number of leaflets"})-[:STATE_OF]->(state4),
            (state41)-[state41_entered_by:ENTERED_BY]->(person),
            (state42:State {name: "even-pinnate", definition: "rachis contains an even number of leaflets"})-[:STATE_OF]->(state4),
            (state42)-[state42_entered_by:ENTERED_BY]->(person),
        (state5:State {name: "bipinnate", definition: "compound leaf dissected twice with leaflets arranged along rachillae that are attached to the rachis (alternate term is twice pinnatley compound)"})-[:STATE_OF]->(character),
        (state5)-[state5_entered_by:ENTERED_BY]->(person),
        (state6:State {name: "tripinnate", definition: "a compound leaf with leaflets attached to secondary rachillae that are in turn attached to rachillae, which are borne on the rachis (alternate term is thrice pinnately compound)"})-[:STATE_OF]->(character),
        (state6)-[state6_entered_by:ENTERED_BY]->(person)
        ON CREATE SET
            character.characterID = apoc.create.uuid(),
            character_entered_by.timestamp = datetime(),
            state1.stateID = apoc.create.uuid(),
            state1_entered_by.timestamp = datetime(),
            state2.stateID = apoc.create.uuid(),
            state2_entered_by.timestamp = datetime(),
            state21.stateID = apoc.create.uuid(),
            state21_entered_by.timestamp = datetime(),
            state22.stateID = apoc.create.uuid(),
            state22_entered_by.timestamp = datetime(),
            state3.stateID = apoc.create.uuid(),
            state3_entered_by.timestamp = datetime(),
            state4.stateID = apoc.create.uuid(),
            state4_entered_by.timestamp = datetime(),
            state41.stateID = apoc.create.uuid(),
            state41_entered_by.timestamp = datetime(),
            state42.stateID = apoc.create.uuid(),
            state42_entered_by.timestamp = datetime(),
            state5.stateID = apoc.create.uuid(),
            state5_entered_by.timestamp = datetime(),
            state1.stateID = apoc.create.uuid(),
            state6_entered_by.timestamp = datetime();

Doing this is certainly possible, but the syntax sucks.

In addition, it would preclude using a "node property existence constraint" on the uuid field, which is something I think we should do if we are licensed for it.

I notice that the cypher automatically created when exporting a db uses CREATE rather than MERGE.

All this has me thinking we should use CREATEs for the initial db setup script, and consider using MERGE for the operational cypher used in the API we develop.

aazaff commented 2 years ago

I think this is all worked out no? @NoisyFlowers

paleobot / pbot-dev

Investigate MERGE and constraints for use in preventing duplicate and stray nodes and relationships #17