Open xinferum opened 5 months ago
I am updating the information: We sometimes have small network lags on these clusters - judging by the pg_auto_failover logs, the data nodes cannot always connect to the monitor. This is what we see in the log of the data node at the moment when it created a connection that hung:
Jan 22 00:50:17 server02.domain pg_autoctl[3583]: 00:50:17 3583 WARN Failed to connect to "postgres://autoctl_node@10.9.10.10:5432/pg_auto_failover?password=****", retrying until the server is ready
Jan 22 00:50:19 server02.domain pg_autoctl[3583]: 00:50:19 3583 INFO Successfully connected to "postgres://autoctl_node@10.9.10.10:5432/pg_auto_failover?password=****" after 2 attempts in 3463 ms.
There are no other logs on the servers from either pg_auto_failover or postgresql.
Judging by the state_change from pg_stat_activity, a hung connection appeared at 0:50:17, which corresponds to an unsuccessful attempt to connect the data code to the monitor (according to its log). The datanode probably connected after all, but the connection was severed at that moment due to the network.
We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).
It seems that there is some kind of problem, because of which such connections continue to be held by the datanode on the monitor, despite the fact that no activity occurs in them.
As a temporary solution, we have set up a task on the monitor in pg_cron, which kills these hung-up connections with datanodes with a certain frequency:
select pg_terminate_backend(pid) from pg_stat_activity where usename = 'autoctl_node' and datname = 'pg_auto_failover' and state = 'idle' and (clock_timestamp() - state_change) > '03:00:00';
This solved the problem for us.
Of course, you can simply delete such connections through any JOB, but I would very much like to hear the opinion of the developers. It looks like in case of network problems, connections accumulate in an internal pool...
Maybe instead of pg_cron, configure idle_session_timeout
to some value like a few hours or days.
Maybe instead of pg_cron, configure
idle_session_timeout
to some value like a few hours or days.
Hello.
As I wrote above:
We tried to set the client_connection_check_interval and idle_session_timeout parameters - this does not give any result. We assume that (since changing both of these parameters did not give a result) the connection is in the idle state, but the client holds it. Therefore, we tried to restart pg_auto_failover on the datanode from which this connection was created - the connection was immediately closed on the monitor. So pg_auto_failover kept this connection from the data node for some reason and did not close it, despite the fact that once a second (judging by the documentation) it created a new connection to exchange information with the monitor (the cluster worked correctly).
We tried - it didn't help, apparently pg_auto_failover itself holds the connection.
@xinferum what is your PostgreSQL version?
15.4 version of PostgreSQL.
I also want to clarify that this behavior is observed only on one of our projects, but there are several pg_auto_failover clusters on it and such connections appear on all of them.
There are no problems on other projects with a similar cluster configuration, but they have a different operating system.
Good afternoon.
pg_auto_failover version: 2.0
We found on several pg_auto_failover clusters the presence on the monitor server of suspended (probably unclosed from the data source) connections in the idle state:
In the example on one of the servers, we see two fresh connections that will work and close, but the rest weigh (some for several days) and do not close. On one of the monitor servers, about 100+ connections have accumulated during its operation.
I understand that these are connections from the date of the year as part of the monitoring protocol https://pg-auto-failover.readthedocs.io/en/main/architecture.html#monitoring-protocol :
But, for some reason, not all connections are closed and remain stuck in idle status. It is possible that the datanodes do not always close the connection during operation.