neo4j / neo4j-python-driver

Neo4j Bolt driver for Python
https://neo4j.com/docs/api/python-driver/current/
Other
898 stars 186 forks source link

ConnectionReset by peer #1073

Closed ritiksoni00 closed 2 months ago

ritiksoni00 commented 2 months ago

Bug Report

i keep getting <CONNECTION> error: ConnectionResetError(104, 'Connection reset by peer')

my logs

[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,463  [#0000]  _: <POOL> acquire routing connection, access_mode='READ', database='neo4j'
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,463  [#0000]  _: <POOL> routing aged?, database=neo4j
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,464  [#0000]  _: <ROUTING> purge check: last_updated_time=16422.846361411, ttl=10, perf_time=16809.259327619 => True
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,464  [#0000]  _: <POOL> dropping routing table for database=neo4j
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,464  [#0000]  _: <ROUTING> checking table freshness (readonly=True): table expired=True, has_server_for_mode=False, table routers={IPv4Address(('opop', 7687))} => False
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,464  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('opop', 7687))
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,464  [#0000]  _: <RESOLVE> in: opop:7687
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,468  [#0000]  _: <RESOLVE> dns resolver out: i[:7687
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,468  [#0000]  _: <POOL> _acquire router connection, database='neo4j', address=ResolvedIPv4Address(('ip', 7687))
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#C44A]  _: <POOL> picked existing connection bolt-209164
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#C44A]  _: <POOL> checked re_auth auth=None updated=False force=False
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#C44A]  _: <POOL> handing out existing connection
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#C44A]  C: ROUTE {'address': 'opop:7687'} () {'db': 'neo4j'}
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#C44A]  _: <CONNECTION> error: ConnectionResetError(104, 'Connection reset by peer')
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  Failed to read from defunct connection ResolvedIPv4Address(('ip', 7687)) (ResolvedIPv4Address(('ip', 7687)))
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#C44A]  C: <CLOSE>
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#0000]  _: <POOL> deactivating address ResolvedIPv4Address(('34.96.218.41', 7687))
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,469  [#0000]  _: <POOL> table={'neo4j': RoutingTable(database='neo4j' routers={IPv4Address(('opop', 7687))}, readers={}, writers={}, last_updated_time=16809.259412321, ttl=0)}
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,470  [#C44A]  _: <POOL> released bolt-209164
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,470  [#0000]  _: <POOL> failed to fetch routing info from ResolvedIPv4Address(('34.96.218.41', 7687))
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,470  [#0000]  _: <POOL> deactivating address IPv4Address(('opop', 7687))
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,470  [#0000]  _: <POOL> table={'neo4j': RoutingTable(database='neo4j' routers={}, readers={}, writers={}, last_updated_time=16809.259412321, ttl=0)}
[Thread 140365114973056] [Task 140365034482304] 2024-07-29 08:23:26,470  Unable to retrieve routing information

first i was using normal sessions but it always throws error then i started using ManagedTransaction cause of retry mechanism as a workaround for this.

but now i added max_connection_lifetime=5 to my driver conf and the problem has been solved kinda! but in logs i see

found unhealthy connection removing it and acquiring new connection not sure whats causing this.

My Environment

Python Version: Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] Driver Version: Version: 5.18.0 Server Version and Edition: its enterprise Operating System: WSL

robsdedude commented 2 months ago

I see in the logs (ResolvedIPv4Address(('ip', 7687))). I hope you manually replaced this. If your DNS resolution returns the ip instead of an actual IP address things seems very broken. The line above (<RESOLVE> dns resolver out: i[:7687) also looks very weird.

Other than that, it's really hard to tell what kills the connection. Connection reset by peer can have many reasons all of which are outside the driver's control. The best thing you can do here is (as you've done) use an API with retries or write your own retry logic.

max_connection_lifetime is another, harsher way to deal with it. It'll tell the driver to never use any connection that has been open for more than the specified number of seconds. Setting this to 5 is a bit much though. I will make connections be (almost) single-use. The added overhead for TCP, potentially TLS, and driver protocol handshaking for basically each query will be significant.

found unhealthy connection removing it and acquiring new connection not sure whats causing this. That's exactly the max_connection_lifetime. Any connection older than this will be considered unhealthy and be closed.

Places you could look for what's causing the reset in the first place:

I'll close the issue as there doesn't seem to be anything to be fixed here. If you have further questions feel free to keep commenting or open a discussion/thread/ticket in one of the above-mentioned places.

ritiksoni00 commented 2 months ago

I see in the logs (ResolvedIPv4Address(('ip', 7687))). I hope you manually replaced this. If your DNS resolution returns the ip instead of an actual IP address things seems very broken. The line above ( dns resolver out: i[:7687) also looks very weird.

Yes i manually updated it, dns resolver out: i[:7687) [ should be p here!

Now as you have suggested server logs and wireshark network capture. by server logs you mean the server where neo4j is runnning? if yes we are using enterprise version.. how can i get it? Wireshark Network Capture

34.96.218.41    192.168.1.3 TCP 7687 → 59330 [RST] Seq=1 Win=0 Len=0
192.168.1.3 34.96.218.41    TLSv1.2 Application Data
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4214316845 TSecr=0 WS=128
34.96.218.41    192.168.1.3 TCP 7687 → 59332 [SYN, ACK] Seq=0 Ack=1 Win=21120 Len=0 MSS=1412 SACK_PERM TSval=366673523 TSecr=4214316845 WS=128
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=1 Ack=1 Win=64256 Len=0 TSval=4214316967 TSecr=366673523
192.168.1.3 34.96.218.41    TLSv1.3 Client Hello (SNI=<changed>.databases.neo4j.io)
34.96.218.41    192.168.1.3 TCP 7687 → 59332 [ACK] Seq=1 Ack=518 Win=20864 Len=0 TSval=366673647 TSecr=4214316968
34.96.218.41    192.168.1.3 TLSv1.3 Server Hello, Change Cipher Spec
34.96.218.41    192.168.1.3 TLSv1.3 Application Data
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=518 Ack=5601 Win=61312 Len=0 TSval=4214317092 TSecr=366673648
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=518 Ack=6439 Win=61312 Len=0 TSval=4214317092 TSecr=366673648
192.168.1.3 34.96.218.41    TLSv1.3 Change Cipher Spec, Application Data
34.96.218.41    192.168.1.3 TCP 7687 → 59332 [ACK] Seq=6439 Ack=598 Win=20864 Len=0 TSval=366673816 TSecr=4214317097
192.168.1.3 34.96.218.41    TLSv1.3 Application Data
34.96.218.41    192.168.1.3 TCP 7687 → 59332 [ACK] Seq=6439 Ack=640 Win=20864 Len=0 TSval=366673938 TSecr=4214317260
34.96.218.41    192.168.1.3 TLSv1.3 Application Data, Application Data
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=640 Ack=6947 Win=64128 Len=0 TSval=4214317383 TSecr=366673940
192.168.1.3 34.96.218.41    TLSv1.3 Application Data
34.96.218.41    192.168.1.3 TLSv1.3 Application Data
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=1091 Ack=7086 Win=64128 Len=0 TSval=4214317509 TSecr=366674065
34.96.218.41    192.168.1.3 TLSv1.3 Application Data
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=1091 Ack=7115 Win=64128 Len=0 TSval=4214317510 TSecr=366674066
192.168.1.3 34.96.218.41    TLSv1.3 Application Data
34.96.218.41    192.168.1.3 TLSv1.3 Application Data
192.168.1.3 34.96.218.41    TLSv1.2 Application Data
34.96.218.41    192.168.1.3 TCP 7687 → 59320 [RST] Seq=1 Win=0 Len=0 <----------- this one
192.168.1.3 34.96.218.41    TCP 59332 → 7687 [ACK] Seq=1173 Ack=7581 Win=64128 Len=0 TSval=4214317686 TSecr=366674192

after that RST packet i get error that Failed to read from defunct connection IPv4Address(('p-e4f47fee-b09b-0056.production-orch-0005.neo4j.io', 7687)) (ResolvedIPv4Address(('34.96.218.41', 7687)))

robsdedude commented 2 months ago

by server logs you mean the server where neo4j is runnning? if yes we are using enterprise version.. how can i get it?

Here are some docs about it https://neo4j.com/docs/operations-manual/current/monitoring/logging/

If you're running on Aura, I've checked but couldn't find a way to access the server's debug logs from there.

It's impossible to tell from what you've posted what's going on. For one, because I can't introspect the application data (because of TLS but also and because "Application Data" is not very verbose). Further, the capture you've posted seems to be containing multiple TCP connections and I have no driver or server logs to correlate it with.

I'm afraid I'm not able to help you much further. As said, it's not a driver issue. Something in the network (the DBMS server, a load balancer, ...) is closing the connection at an unexpected point in its lifetime.

You could also try to play around with the liveness_check_timeout driver config, which will make the driver check if connections that have been idle for more than the timeout are still alive by sending ping first before handing them out from the pool. This means more overhead at the time of borrowing connections from the pool but also less frequent errors and retries.

ritiksoni00 commented 2 months ago

I've checked but couldn't find a way to access the server's debug logs from there.

iirc there is button in admin console of Aura website to get last 3 day logs. (i dont have access to aura website so i cant check now.) but shouldnt be there a way to get server logs via the api itself?

anyways im thinking it might be because of using sessions in a bad manner or some internal issue of neo4j.

currently im managed txn which retries on this error.

@robsdedude thanks for the heads up tho.