hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 112 forks source link

Hung (unclosed) old connections in idle status on the monitor from datanodes #1028

Open xinferum opened 5 months ago

xinferum commented 5 months ago

Good afternoon.

pg_auto_failover version: 2.0

We found on several pg_auto_failover clusters the presence on the monitor server of suspended (probably unclosed from the data source) connections in the idle state:

datid   pid usename     application_name        client_addr backend_start           state_change            "now() - backend_start" state   wait_event_type     wait_event  query
16521   272257  autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-10 18:36:34.894 +0300   2024-01-10 18:36:36.296 +0300   8 days 14:36:37.430901  idle    Client          ClientRead  
16521   1195688 autoctl_node    pgautofailover_standby_3    10.9.10.13  2024-01-11 17:31:07.396 +0300   2024-01-11 17:31:09.309 +0300   7 days 15:42:04.928758  idle    Client          ClientRead  
16521   1714307 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-12 06:23:28.683 +0300   2024-01-12 06:23:30.095 +0300   7 days 02:49:43.641432  idle    Client          ClientRead  
16521   2335845 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-12 21:51:32.808 +0300   2024-01-12 21:51:34.295 +0300   6 days 11:21:39.51676   idle    Client          ClientRead  
16521   2882026 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-13 11:24:12.232 +0300   2024-01-13 11:24:14.003 +0300   5 days 21:49:00.093055  idle    Client          ClientRead  
16521   3127812 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-13 17:29:09.876 +0300   2024-01-13 17:29:11.251 +0300   5 days 15:44:02.448422  idle    Client          ClientRead  
16521   1580029 autoctl_node    pgautofailover_standby_3    10.9.10.13  2024-01-16 11:13:23.745 +0300   2024-01-16 11:13:25.066 +0300   2 days 21:59:48.579498  idle    Client          ClientRead  
16521   1938664 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-16 20:10:02.488 +0300   2024-01-16 20:10:04.085 +0300   2 days 13:03:09.836981  idle    Client          ClientRead  
16521   2140541 autoctl_node    pgautofailover_standby_3    10.9.10.13  2024-01-17 01:11:00.742 +0300   2024-01-17 01:11:02.403 +0300   2 days 08:02:11.58282   idle    Client          ClientRead  
16521   2275548 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-17 04:31:38.530 +0300   2024-01-17 04:31:40.203 +0300   2 days 04:41:33.794781  idle    Client          ClientRead  
16521   3986569 autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-18 22:55:32.673 +0300   2024-01-18 22:55:34.221 +0300   10:17:39.651776     idle    Client          ClientRead  
16521   4040617 autoctl_node    pgautofailover_standby_3    10.9.10.13  2024-01-19 00:15:48.534 +0300   2024-01-19 00:15:50.478 +0300   08:57:23.790725     idle    Client          ClientRead  
16521   209392  autoctl_node    pgautofailover_standby_3    10.9.10.13  2024-01-19 09:13:11.821 +0300   2024-01-19 09:13:11.827 +0300   00:00:00.50327      idle    Client          ClientRead  LISTEN "state"
16521   209398  autoctl_node    pgautofailover_standby_1    10.9.10.11  2024-01-19 09:13:12.223 +0300   2024-01-19 09:13:12.228 +0300   00:00:00.101521     idle    Client          ClientRead  LISTEN "state"

In the example on one of the servers, we see two fresh connections that will work and close, but the rest weigh (some for several days) and do not close. On one of the monitor servers, about 100+ connections have accumulated during its operation.

I understand that these are connections from the date of the year as part of the monitoring protocol https://pg-auto-failover.readthedocs.io/en/main/architecture.html#monitoring-protocol :

Data nodes periodically connect and run SELECT pgautofailover.node_active(…) to communicate their current state and obtain their goal state.

But, for some reason, not all connections are closed and remain stuck in idle status. It is possible that the datanodes do not always close the connection during operation.

xinferum commented 5 months ago

I am updating the information: We sometimes have small network lags on these clusters - judging by the pg_auto_failover logs, the data nodes cannot always connect to the monitor. This is what we see in the log of the data node at the moment when it created a connection that hung:

Jan 22 00:50:17 server02.domain pg_autoctl[3583]: 00:50:17 3583 WARN  Failed to connect to "postgres://autoctl_node@10.9.10.10:5432/pg_auto_failover?password=****", retrying until the server is ready
Jan 22 00:50:19 server02.domain pg_autoctl[3583]: 00:50:19 3583 INFO  Successfully connected to "postgres://autoctl_node@10.9.10.10:5432/pg_auto_failover?password=****" after 2 attempts in 3463 ms.

There are no other logs on the servers from either pg_auto_failover or postgresql.

Judging by the state_change from pg_stat_activity, a hung connection appeared at 0:50:17, which corresponds to an unsuccessful attempt to connect the data code to the monitor (according to its log). The datanode probably connected after all, but the connection was severed at that moment due to the network.

We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).

It seems that there is some kind of problem, because of which such connections continue to be held by the datanode on the monitor, despite the fact that no activity occurs in them.

xinferum commented 5 months ago

As a temporary solution, we have set up a task on the monitor in pg_cron, which kills these hung-up connections with datanodes with a certain frequency:

select pg_terminate_backend(pid) from pg_stat_activity where usename = 'autoctl_node' and datname = 'pg_auto_failover' and state = 'idle' and (clock_timestamp() - state_change) > '03:00:00';

This solved the problem for us.

sgrinko commented 5 months ago

Of course, you can simply delete such connections through any JOB, but I would very much like to hear the opinion of the developers. It looks like in case of network problems, connections accumulate in an internal pool...

Akkowicz commented 3 months ago

Maybe instead of pg_cron, configure idle_session_timeout to some value like a few hours or days.

xinferum commented 3 months ago

Maybe instead of pg_cron, configure idle_session_timeout to some value like a few hours or days.

Hello.

As I wrote above:

We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).

We tried - it didn't help, apparently pg_auto_failover itself holds the connection.

Akkowicz commented 3 months ago

@xinferum what is your PostgreSQL version?

xinferum commented 3 months ago

15.4 version of PostgreSQL.

I also want to clarify that this behavior is observed only on one of our projects, but there are several pg_auto_failover clusters on it and such connections appear on all of them.

There are no problems on other projects with a similar cluster configuration, but they have a different operating system.