neo4j / neo4j-go-driver

Neo4j Bolt Driver for Go
Apache License 2.0
485 stars 68 forks source link

"Unable to retrieve routing table" error in single instance mode #532

Closed skaurus closed 8 months ago

skaurus commented 11 months ago

I have a modest database with just 226 nodes and 322 relationships. Most of the time my queries work fine, but sometimes, with the difference in just using different parameter values, hang up and I see the following error in my log:

TransactionExecutionLimit: timeout (exceeded max retry time: 10s) after 5 attempts, last error: ConnectivityError: Unable to retrieve routing table from localhost:7687: context deadline exceeded

On the 5.10.0 driver version it looked like this (I want this version go to search index as well): error error could not acquire router lock in time when invalidating reader occurred after previous error could not acquire server lock in time when cleaning up pool occurred after previous error could not acquire router lock in time when getting routing table

Actually, usually it happens rarely and sporadically, but recently I hit a query (or rather it's parameters) which trigger it 100% of the time.

As far as Googling goes, I found "routing table" term referring only to the cluster configurations, so I'm confused.

Relevant settings in my app config:

connect_timeout = "2000ms"
max_connection_lifetime = "5m"
max_connection_pool_size = 100
connection_acquisition_timeout = "2000ms"
max_transaction_retry_time = "10s"
query_timeout = "3s"
transaction_timeout = "5s"

And settings I've changed in the neo4j config:

server.memory.heap.initial_size=16g
server.memory.heap.max_size=16g
server.memory.pagecache.size=16g

Other than Neo4j, that server has PostgreSQL and Redis, both similarly memory constrained, and as you might imagine from my DB size, the app that uses all of this has nearly 0 load. It's an early development prototype.

I don't think there is a point in publishing my query without my full database, but if it is required, let me know, I'll try to do both.

domgreen commented 11 months ago

Also seeing similar behavior using Neo4j Aura.

fbiville commented 11 months ago

The team will look into the issue ASAP. For clarification, the driver will always fetch a routing table when the neo4j:// scheme is used, even against single instances (since Neo4j 4.0).

domgreen commented 11 months ago

Thanks @fbiville ... to give a-bit of an update.

It turns out that in my case at least my request the deadline (supplied when calling via http) had been exceeded when neo4j.ExecuteQuery was called and session was being created. Changing my code so that no time out was present removed this issue for me... I then went and fixed the root cause of the high latency and I'm back to working okay.

So my gut feeling might be that the error might be confusing for the end user?

Thought I would add some context if it helps.

robsdedude commented 8 months ago

@skaurus did you manage to resolve your issue in the meantime?

If not, it'd be really helpful if you could provide driver debug and bolt logs so that we can see what the driver is doing before the failure. To achieve this, simply configure a ConsoleLogger when creating the driver and a ConsoleBoltLogger for every session you create (or with ExecuteQuery, if you're using that API).

skaurus commented 8 months ago

Hey @robsdedude! I stopped working on this project, so I don't know... I think I remedied it back then by increasing timeouts and rewriting my queries a bit, don't remember if that fixed the problem completely or just made it much less likely. Probably the issue can be closed then?

robsdedude commented 8 months ago

Yep, I'll close it. Thanks for reaching out @skaurus