neo4j / neo4j-java-driver

Neo4j Bolt driver for Java
Apache License 2.0
329 stars 155 forks source link

Application hang after Neo4j connection did not respond #1504

Open pattanmzy opened 11 months ago

pattanmzy commented 11 months ago

Hi,

I have a Spring Boot application with Neo4j AuraDB. The application was working fine initially. However, occasionally we observed that queries to the Neo4j AuraDB hang and did not respond after a long period of time, e.g. 10 - 20 minutes. When this happened, we saw the following message in the logs.

[Neo4jDriverIO-2-1] - Closing connection pool towards p-4dc8adc8-7ed2-7.production-orch-0361.neo4j.io:7687, it has no active connections and is not in the routing table registry.

We generated a few thread dumps every 30s. We observed that those threads still hang and the number of threads waiting for response from neo4j connection kept piling up. There are no other related errors in the log.

Connection uri: neo4j+s://.databases.neo4j.io Spring Boot starter version: 3.0.8 Spring Data Neo4j version: 7.0.7 Neo4j java driver version: 5.9.0 Neo4j version: 5 Neo4j Aura Professional on AWS Operating System: Alpine Linux v3.18 on Kubernetes EKS Java version: openjdk version "17.0.9"

Expected behavior

Database query should execute successfully

Actual behavior

Connection to AuraDB did not respond causing thread to hang.

Thread dumps ThreadDumps.zip

Could you please advise what could be the root cause of this issue.

Do let me know if you need more info.

Thank you.

injectives commented 10 months ago

@pattanmzy, thanks for your message. Could you let us know if you are using reactive API by any chance please?

pattanmzy commented 10 months ago

Thanks @injectives for your reply.

The application does not use reactive API but it has dependency on Spring Boot Actuator for monitoring purpose. From the thread dumps, I can see that the Actuator HealthEndpoint uses the ReactiveHealthContributors to perform health check. There were many similar threads hanging waiting for response from Actuator health check. Is this the cause of the problem?

Extracted from thread dump _"http-nio-8080-exec-1" #52 daemon prio=5 os_prio=0 cpu=2079.61ms elapsed=84690.92s tid=0x00007f8bc5150fd0 nid=0x3a waiting on condition [0x00007f8bbe453000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)

injectives commented 10 months ago

Is this the cause of the problem?

If I remember correctly, the healthcheck uses the driver session run method. If subscription is cancelled (for whatever reason), it might be a cause of such issue, but this may also be something else.

Neo4j java driver version: 5.9.0

I suggest trying the latest driver version 5.14.0 as 5.11.0 introduced the cancellation support, see the release notes.

pattanmzy commented 10 months ago

Thanks for the feedback. I have updated the application to use the latest driver version. I will continue to monitor the application and will provide an update here if the problem still persist.

pattanmzy commented 10 months ago

Hi @injectives ,

The connection issue happened again a few days after I upgraded the Neo4j Java driver to the latest version, i.e. 5.14.0. From the attached thread dumps that were generated at 30s to 1 min intervals, we observed that there were application initiated queries to Neo4j AuraDB queries as well as Actuator health check threads hang and did not respond.

There were errors such as connection to the database terminated, connection read timed out, server is no longer available, failed to obtain connection towards WRITE server, etc. found in the application logs. Please find the details in the application and trace logs attached.

Spring Boot starter version: 3.0.12 Spring Data Neo4j version: 7.0.11 Neo4j java driver version: 5.14.0

Thread dumps ThreadDumps.zip

Logs application-log.txt trace-log.txt

Could you please look into this? Thank you.

injectives commented 10 months ago

Hello.

Sure. The thread dumps show the http threads are in WAITING state, while the driver threads are actually in RUNNABLE state. It also seems like the main issue is around begin transaction area.

The application-log.txt and trace-log.txt show mostly connectivity issues, which could be because of some server-side maintenance or some other factor. Although, I am a little concerned about this one: org.neo4j.driver.exceptions.ConnectionReadTimeoutException: Connection read timed out due to it taking longer than the server-supplied timeout value via configuration hint. - it basically means that the server did not respond within the timeout it suggested to the driver upon connection.

The driver debug logs might reveal more context on this issue. Would you be happy to enable and supply those please? The driver logging can be configured though the following config option.

As a side question, could you let me know if you set the target database explicitly in the driver session please? See the following config option for more details. If possible, it would be better to set this to narrow down the issue search area a little bit.

And one more general question please. Has your application been using Aura for a while before the issue started happening on a certain date or is it a new deployment? I am just trying to understand if this issue just started happening on a certain date, which could mean some server change.

Thank you.

pattanmzy commented 10 months ago

@injectives

The driver debug logs might reveal more context on this issue. Would you be happy to enable and supply those please?

I have set the logging level to debug. I will provide the logs when the issue happens again.

As a side question, could you let me know if you set the target database explicitly in the driver session please?

The target database was not set previously. It's now set.

Has your application been using Aura for a while before the issue started happening on a certain date or is it a new deployment?

It is a new deployment of less than 2 months. This issue started happening since the beginning.

pattanmzy commented 9 months ago

Hi @injectives ,

The issue happened again on December 26th. The thread dumps show similar issue as before. There's no error found in the application logs. However, there were warning messages on Actuator Health contributor taking long time to respond in the console log as below. There were also no application activity related log messages after this time. It appears that the issue started to happen around this time.

2023-12-26 11:20:34 WARN  [http-nio-8080-exec-2] - Health contributor org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1 (neo4j) took 30988ms to respond
2023-12-26 11:20:34 WARN  [http-nio-8080-exec-3] - Health contributor org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1 (neo4j) took 61001ms to respond

The driver debug logs are attached.

Thread dumps ThreadDumps_20231226.zip

Driver Debug Log neo4j.log.zip

Console log console.log

Appreciate if you could take a look at it. Thank you.

injectives commented 9 months ago

@pattanmzy, thanks a lot for providing the vital details. We will review soon and will get back to you.

shenhj2016 commented 6 months ago

After view the neo4j.log.zip,i have found the same issue in my application.The application is running in AWS. Log detail:{ "log": "Suppressed: org.neo4j.driver.exceptions.DiscoveryException: Failed to update routing table with server '58534a99.databases.neo4j.io(54.216.115.14):7687'.", "stream": "stdout", "time": "2024-02-21T08:51:21.223046235Z" } neo4j-driver version is :5.15.0, and it have been update to 5.18.0 which is the latest version.