neo4j / neo4j-go-driver

Neo4j Bolt Driver for Go
Apache License 2.0
494 stars 70 forks source link

Could not aquire server lock in time when using driver #419

Closed rlch closed 1 year ago

rlch commented 1 year ago

Hey guys,

We're about to deploy Neo4J to production in a clustered k8s environment over the next few weeks. We're using Neo4J 5.2 with the latest Go-driver.

After doing some basic stress-testing (~150 requests/min on a pod with 1.5 vCPUs, 6GB memory) we've found that the server completely buckles (won't take anymore requests) when handling concurrent requests. In order to resolve I have to restart the pod. I've verified that sessions are being closed, access modes are set properly, and that no transactions are active when the server is in the buckled state. Weirdly, the browser works fine which makes me think this is an issue with the driver.

The resulting error we get from the go-driver is:

could not acquire server lock in time when cleaning up pool after TransactionExecutionLimit: Timeout after 5 attempts, last error: could not acquire server lock in time when computing server penalties

Any help is greatly appreciated and very happy to jump into a call / w/e works best. Excited to make Neo4J a big part of our infrastructure looking forward if we can resolve these issues!

For more information, a colleague created a forum post here: https://community.neo4j.com/t5/neo4j-graph-platform/could-not-acquire-server-lock-in-time-when-cleaning-up-pool/td-p/63323

Cheers!

Neo4j Version: 5.2 Enterprise
Neo4j Mode: Single pod (clustered) via Helm charts Driver version: Go driver 5.3.0

rlch commented 1 year ago

@fbiville, btw - apologies for my delayed response on the generics issues. All my bandwidth has been dedicated towards solving this issue unfortunately

fbiville commented 1 year ago

Thanks for the detailed report, I'll follow up in the community post first.

fbiville commented 1 year ago

@rlch I have not heard back from the community post. Do you still need help with this?

rlch commented 1 year ago

@fbiville Apologies - we've been on leave. I've responded to the community post, thanks for following up :)

fbiville commented 1 year ago

We communicated via email and it seems the issue was due to over-allocation of connections which resulted in the server being overloaded. I think it's worth opening an issue against github.com/neo4j/neo4j, I'll make sure to forward it to my Bolt server teammates. Feel free to reopen the issue if a similar issue happens again.