DNS resolution fails when cluster members are offline

ekampp commented 3 years ago

Hi team.

First off, thanks for a super nice tool, and sorry if we're just not understanding it correctly. We have recently migrated to JRuby to leverage the java driver for a variety of reasons.

We realized that neo4j-java-driver is incorrectly operating with DNS resolution mode.

Neo4j version: Enterprise 4.2.2
Neo4j Mode: A cluster with 5 members/Casual cluster with 5 core 0 read-replica
Driver version: neo4j-java-driver-4.2.0.beta.0-java
Operating system: Ubuntu 20.04.2 (official neo4j AMI)

Steps to reproduce

The following connection URI was used:

NEO4J_URL=neo4j://<user>:<password>@neo4j.infra.prod.internal:7687

The DNS query for neo4j.infra.prod.internal returns 3 addresses:

neo4j.infra.prod.internal has address 10.26.219.30
neo4j.infra.prod.internal has address 10.26.221.188
neo4j.infra.prod.internal has address 10.26.204.91

When all aforementioned members are available, everything is fine. The driver connects to one of them successfully and gets the neo4j routing table:

INFO: Closing connection pool towards neo4j.infra.prod.internal(10.26.204.91):7687, it has no active connections and is not in the routing table registry.

However, if one of those members is down and the gethostbyname() library call returns its address at the top of the resulting list, then the neo4j-java-driver will fail to bootstrap:

Caused by: org.neo4j.driver.exceptions.ServiceUnavailableException: Unable to connect to neo4j.infra.prod.internal(10.26.221.188):7687, ensure the database is running and that there is a working network connection
to it.

Expected behavior

When a node member goes away, the connection should roll over to the next online member.

Actual behavior

The driver is not trying to connect to other hosts returned in the list.

Logs

logs.txt

injectives commented 3 years ago

Hi @ekampp,

Thank you for reporting this issue. It has actually been discovered internally too and we are going to fix it. An update will be posted here later on.

For the time being you might want to try the latest 4.2.1 version, which fixed another issue, but may result in a more reliable behaviour (especially in combination with the retry mechanism in the tx functions).

ekampp commented 3 years ago

@injectives, thank you for the quick and positive feedback.

To be crystal clear, when you said "For the time being, you might want to try the latest 4.2.1 version," did you mean java driver or neo4j version?

injectives commented 3 years ago

I meant the Java Driver :)

injectives commented 3 years ago

@ekampp, we have just released a new Neo4j Java Driver version 4.2.4 that fixes this issue.

Please let us know if you experience this problem again with the new version.

ekampp commented 3 years ago

@injectives, I will let you know if anything pops up again. Thanks for your help!

neo4j / neo4j-java-driver