neo4j-contrib / neo4j-apoc-procedures

Awesome Procedures On Cypher for Neo4j - codenamed "apoc"                     If you like it, please ★ above ⇧            
https://neo4j.com/labs/apoc
Apache License 2.0
1.71k stars 494 forks source link

Trigger Replication Problem #2193

Closed conker84 closed 3 years ago

conker84 commented 3 years ago

Pasted form internal trello card

Neo4j Enterprise 4.2.9 APOC 4.2.0.2

Customer reported and I have since tested what seems to be a failure to replicate triggers created via apoc.trigger.add(). I’m aware of a bug in apoc that throws a server no longer accepts writes error unless you connect directly via bolt:// to the System db leader and add the trigger there (related cards and GH issue attached). But now, that trigger doesn’t appear anywhere else on the cluster, apart from that one core that I created it on. Is raft somehow being selective in what transactions its replicating across to members?? Since if I repeat the exercise with a simple create/merge node (also having connected directly via bolt to a user db) the merged node appears on all cores straightaway. I don’t think this is down to replication of updates in the system db, since other updates to system are replicated just fine.

So the question is: What could possibly be preventing that trigger from appearing on nodes other than the one it was directly created on (system leader) using bolt://??

Interestingly, the trigger appears just fine on all cores, on a 3.5.29 cluster, which adds to my suspicion about how the System db updates in 4.x are being replicated, differently, to how graph.db updates were handled in 3.5.x

Attaching logs and relevant files from both test clusters (4.2.9 and 3.5.29). Please let me know if any additional details are required.

Repro steps:

1- connect via neo4j:// to a 4.x cluster (any 4.x version). Ensure beforehand to have the appropriate apoc plugin under /plugins directory and to whitelist apoc as follows in neo4j.conf:

apoc.trigger.enabled=true
dbms.security.procedures.whitelist=apoc.*
dbms.security.procedures.unrestricted=apoc.*

2- run CALL apoc.trigger.add("trigger1", "MERGE (p:Person {name: 'invalid'})",{phase:'before'})

unless one connects to instance that just happens to be the system db leader at the time, the following error will be thrown:

Server at mydomainl:7687 no longer accepts writes

This is a known bug with apoc, reported on the attached cards/GH.

3- CALL dbms.cluster.overview() and identify leader for the system db. Then connect via bolt:// directly (either via browser or cypher-shell) to that instance.

4- run CALL apoc.trigger.add("trigger1", "MERGE (p:Person {name: 'invalid'})",{phase:'before'}). This succeeds and one can view trigger1 as a listed trigger via call apoc.trigger.list().

5- Connect to any other core via bolt:// that is NOT the leader for system db and execute call apoc.trigger.run. Result: Trigger1 is NOT listed there, or on any other cluster members. This appears just fine on a 3.5.x cluster, but not in 4.x.

Thanks

conker84 commented 3 years ago

there are two problems here: