neo4j / neo4j-java-driver

Neo4j Bolt driver for Java
Apache License 2.0
329 stars 155 forks source link

LoadBalancer failover is broken #1046

Closed wdroste closed 2 years ago

wdroste commented 3 years ago

Our customer is experiencing a problem with our App and Neo4j 4.2.x Enterprise, using a 3-node Neo4j Causal Cluster, all on RHEL 7.

When the Neo leader node dies and a new leader is elected, the app ui hangs, and the following message appears in log over and over for a while, then the exceptions start:

2021-10-20 13:38:59.196 WARN 865 — [ Neo4jDriverIO-2-9] [-] org.neo4j.driver.LoadBalancer : Failed to obtain a connection towards address 10.x.x.x:7687, will try other addresses if available. Complete failure is reported separately from this entry.

Java Driver: 4.3.2-4

10.x.x.x is the failed node in this case. I have reproduced this problem today with Neo4j 4.2.11 and Neo4j 4.3.6. I am unable to repro with Java Driver 4.1.1 and Neo4j 4.2.11

This leads me to believe that neo4j failover broke somewhere between 4.1.1->4.3.2. In the problem case, ui never recovers. In the problem case, if you restart app - even with the failed neo node still down - the app comes up and works. In the working case, the app ui recovers after a minute or so.

injectives commented 2 years ago

Thanks for reporting this.

We have tests that cover similar cases and it would be interesting to see if those can be improved.

However, I could not reproduce so far. Here is what I did:

  1. Configured 3 node cluster running as containers (using the neo4j:4.2-enterprise image).
  2. Created a small app that periodically creates a node (using the latest 4.4.0-beta01 driver):
    try (
        Session session = driver.session( sessionConfig ) )
    {
        session.writeTransaction( tx ->
                                 {
                                     var result = tx.run( "CREATE (n:Testing) RETURN n" );
                                     result.consume();
                                     return null;
                                 } );
    }
  3. Stopped leader container between node creation attempts
  4. Subsequent attempt printed the mentioned warning, but performed rediscovery and used a new leader afterwards.

Would you be able to do the following please?

  1. Check for differences in setup.
  2. Try the latest 4.4.0-beta01 driver.
  3. Let us know what URL is provided to the driver.
  4. Provide API usage sample from your app where the issue occurs.
wdroste commented 2 years ago

Let me see what i can do, because we don't usually take beta-drivers, what's the ETA on release?

injectives commented 2 years ago

There is a GA version already, just use 4.4.1.