neo4j-contrib / neomodel

An Object Graph Mapper (OGM) for the Neo4j graph database.
https://neomodel.readthedocs.io
MIT License
939 stars 231 forks source link

neo4j.exceptions.ClientError: No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER #335

Open robertlagrant opened 6 years ago

robertlagrant commented 6 years ago

A funny one:

We're getting intermittent errors as per the issue title - i.e. it's trying to write to a follower node, and presumably bolt+routing isn't sending the transaction to the leader. Am I missing something? Is it that if the first interaction with the database is a read, that it opens the transaction on a follower node? Can I force it to the leader for every transaction?

robertlagrant commented 5 years ago

We are still getting this issue, even when forcing a write transaction.

I've created a repro case: https://github.com/robertlagrant/neo4j-cluster-failure. Please test.

aanastasiou commented 5 years ago

@robertlagrant Would it be possible to share a little bit more information on your cluster configuration? Is that supposed to be 3 CORE servers? There are some conditions where what you describe might be the intended behaviour at least as far as RAFT is concerned (i.e. see this). I am trying to see how much of this can be dealt with at the level of neomodel and how much of this is external to it.

mvanderkroon commented 5 years ago

Please see https://neo4j.com/docs/ogm-manual/current/reference/ (section 3.14.1.6. Retry mechanisms).

For critical applications, these failures have to be anticipated, and also managed at the architecture or application level. Even if the driver handles some low level retries, it is not always enough in case of instability, as an application may involve complex business logic, and require coarse grained units of work.

In other words, the driver does not deal with higher level failures (such as cluster disconnects). In our use cases we have worked around this by adding custom retry logic to our business logic. See very basic example down below (adding jitter and exponential backoff obviously highly recommended).

sts = time.time()
while True:
    last_exception = None
    cts = time.time()

    if cts - sts > _MAX_RETRY_SECONDS:
        raise last_exception

    try:
        session.write_transaction(do_write())
        break
    except Exception as e:
        time.sleep(1)
        last_exception = e
aanastasiou commented 5 years ago

@mvanderkroon Thank you very much, sounds like a modification is required at this point (?).

mvanderkroon commented 5 years ago

@aanastasiou I believe so. I have forked the repo, made the necessary changes and would be quite happy to issue a pull request. Should I point it to your master branch?

aanastasiou commented 5 years ago

@mvanderkroon Thank you very much and I do not see why not. It should be sent as a pull request to the main neomodel repo. All the best.

robertlagrant commented 5 years ago

@aanastasiou sure - it's a 3 core server cluster. There are also 2 read replicas, but they don't really feature in this situation as far as I'm aware.

aanastasiou commented 5 years ago

@robertlagrant Thank you for your response, I think that the discussion with @mvanderkroon on the pull request was very informative about the specifics.

kant111 commented 5 years ago

Why follower cannot accept writes?

robertlagrant commented 4 years ago

@kant111 because that's not how Neo4J works.

ayoubelmimouni commented 3 years ago

when using a connection URL of bolt+routing:// this indicates the session is now cluster aware, whereas bolt:// does not understand the other members in a cluster. However it is not simply the bolt+routing:// connection URL is only half the story. It is also the usage of session.readTransaction() and session.writeTransaction() whereby each allows you to pass the Cypher to be executed. If you send a cypher statement through session.writeTransaction and the connection URL was bolt+routing:// then regardless of the member connected to, the Cypher write statement will be routed to the LEADER. As such if one connects to bolt+routing:// and calls a session.writeTransaction() as the transaction is defined as a write it will automatically be routed to the LEADER. It is important to note that Neo4j does not parse the Cypher statement to auto detect if the Cypher is a read or write statement. So one could actually issue a session.readTransaction("create (n:Person {id:1})") and because it is defined as a 'readTransaction` it would be routed to a Follower, but then fail since only LEADERs can perform writes.

gwvandesteeg commented 1 year ago

Fun fact (tested on Neo4J 4.0.7)

Adding a trigger can only be done on the node in the cluster that is the LEADER of both the DB you are adding the trigger to AND the system database (might need the neo4j DB as well, wasn't sure, but we don't use it).

The example below is me trying to add a trigger whilst connected to the node neo4j-core-2 via the bolt connector

neo4j@nextvoice> call dbms.cluster.overview();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                | databases                                                      | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "53f95bdf-0c86-4826-8244-4ad4f7963592" | ["bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-2.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "LEADER", neo4j: "FOLLOWER", system: "FOLLOWER"}   | []     |
| "6b74a7fa-626d-4994-af32-1432b9e8b0c4" | ["bolt://neo4j-core-0.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-0.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "LEADER", system: "LEADER"}     | []     |
| "775b45fe-3ae3-466d-9ad2-7b8e5ae82e0b" | ["bolt://neo4j-core-1.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-1.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "FOLLOWER", system: "FOLLOWER"} | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

3 rows available after 6 ms, consumed after another 1 ms
neo4j@nextvoice> CALL apoc.trigger.add(
                 "assertExtensionNumberValidNumericalString",
                 "WITH '^([0-9]{2,5})$' AS extNumStrRegex
                 MATCH (e:Extension)
                 CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
                 RETURN NULL",
                 { phase: 'before' }
                 );
No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER

After a bunch of killing nodes and waiting for them to come back to the desired state, and connected to neo4j-core-0 via the bolt connector

neo4j@nextvoice> call dbms.cluster.overview();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                | databases                                                      | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "53f95bdf-0c86-4826-8244-4ad4f7963592" | ["bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-2.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "FOLLOWER", system: "FOLLOWER"} | []     |
| "6b74a7fa-626d-4994-af32-1432b9e8b0c4" | ["bolt://neo4j-core-0.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-0.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "LEADER", neo4j: "LEADER", system: "LEADER"}       | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

2 rows available after 0 ms, consumed after another 1 ms
neo4j@nextvoice> CALL apoc.trigger.add(
                 "assertExtensionNumberValidNumericalString",
                 "WITH '^([0-9]{2,5})$' AS extNumStrRegex
                 MATCH (e:Extension)
                 CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
                 RETURN NULL",
                 { phase: 'before' }
                 );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name                                        | query                                                                                                                                                                              | selector          | params | installed | paused |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "assertExtensionNumberValidNumericalString" | "WITH '^([0-9]{2,5})$' AS extNumStrRegex
MATCH (e:Extension)
CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
RETURN NULL" | {phase: "before"} | {}     | TRUE      | FALSE  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row available after 10 ms, consumed after another 30 ms