We're evaluating pg_auto_failover on a small two nodes cluster without any real workload. From time to time we notice that the OOM Killer terminates the pg_auto_failover process because it uses up all available memory. The issue doesn't happen that often, only once in a month or two, but we'd rather have it sorted out before completing our evaluation.
I couldn't find anything relevant in the existing issues. Any help in narrowing it down is appreciated.
Version information
pg_auto_failover
$ /usr/pgsql-14/bin/pg_autoctl --version
pg_autoctl version 1.6.4
pg_autoctl extension version 1.6
compiled with PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
compatible with Postgres 10, 11, 12, 13, and 14
PostgreSQL
postgres=# select version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 14.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)
OS
$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
$ uname -a
Linux evaldb01 3.10.0-1160.53.1.el7.x86_64 #1 SMP Fri Jan 14 13:59:45 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Jul 13 07:31:39 evaldb01 pg_autoctl[58364]: 07:31:39 16845 WARN Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:31:39 evaldb01 pg_autoctl[58364]: 07:31:39 16845 INFO Successfully connected to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 2 attempts in 1891 ms.
Jul 13 07:32:03 evaldb01 pg_autoctl[58364]: 07:32:02 16845 WARN Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:02 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: timeout expired
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:03 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 1 attempts in 2339 ms, pg_autoctl stops retrying now
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:03 16845 ERROR Failed to maintain replication slots on the local Postgres instance, see above for details
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:03 16845 WARN pg_autoctl failed to ensure current state "primary": PostgreSQL is running
Jul 13 07:32:08 evaldb01 pg_autoctl[58364]: 07:32:08 16845 WARN Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:10 evaldb01 pg_autoctl[58364]: 07:32:09 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for co
Jul 13 07:32:11 evaldb01 pg_autoctl[58364]: 07:32:09 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 2 attempts in 3147 ms, pg_autoctl stops retrying now
Jul 13 07:32:12 evaldb01 pg_autoctl[58364]: 07:32:10 16845 WARN Failed to update the local Postgres metadata
Jul 13 07:32:12 evaldb01 pg_autoctl[58364]: 07:32:10 16845 WARN Failed to update the keeper's state from the local PostgreSQL instance.
Jul 13 07:32:12 evaldb01 pg_autoctl[58364]: 07:32:12 16845 WARN Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:15 evaldb01 pg_autoctl[58364]: 07:32:14 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for co
Jul 13 07:32:16 evaldb01 pg_autoctl[58364]: 07:32:14 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 4 attempts in 2225 ms, pg_autoctl stops retrying now
Jul 13 07:32:16 evaldb01 pg_autoctl[58364]: 07:32:15 16845 ERROR Failed to maintain replication slots on the local Postgres instance, see above for details
Jul 13 07:32:16 evaldb01 pg_autoctl[58364]: 07:32:15 16845 WARN pg_autoctl failed to ensure current state "primary": PostgreSQL is running
Jul 13 07:32:19 evaldb01 pg_autoctl[58364]: 07:32:19 16845 WARN Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:22 evaldb01 pg_autoctl[58364]: 07:32:21 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for co
Jul 13 07:32:23 evaldb01 pg_autoctl[58364]: 07:32:21 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 4 attempts in 2683 ms, pg_autoctl stops retrying now
Jul 13 07:32:23 evaldb01 pg_autoctl[58364]: 07:32:21 16845 WARN Failed to update the local Postgres metadata
Jul 13 07:32:23 evaldb01 pg_autoctl[58364]: 07:32:22 16845 WARN Failed to update the keeper's state from the local PostgreSQL instance.
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 58364 ERROR pg_autoctl service node-active exited after receiving signal Killed
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 58364 INFO Restarting service node-active
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO /usr/pgsql-14/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/14/data -v
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO Reloaded the new configuration from "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/14/data/pg_autoctl.cfg"
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO Reloading Postgres configuration and HBA rules
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO pg_autoctl service is running, current state is "primary"
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes.
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO Ensuring HBA rules for node 2 "node_2" (evaldb02.imagemaster.local:5432)
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO Reloading Postgres configuration and HBA rules
PostgreSQL
2022-07-13 01:00:04.649 CEST [58401]: [201-1] LOG: checkpoint starting: immediate force wait
2022-07-13 01:00:05.324 CEST [58401]: [202-1] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.554 s, sync=0.002 s, total=0.685 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16384 kB
2022-07-13 01:30:04.415 CEST [58401]: [203-1] LOG: checkpoint starting: time
2022-07-13 01:30:04.423 CEST [58401]: [204-1] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.008 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB, estimate=16384 kB
2022-07-13 07:31:34.924 CEST [55023]: [1-1] FATAL: too many connections for role "pgautofailover_monitor"
2022-07-13 07:32:04.922 CEST [58398]: [49-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:07.259 CEST [58398]: [50-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:07.364 CEST [55063]: [1-1] LOG: could not send data to client: Broken pipe
2022-07-13 07:32:07.537 CEST [55063]: [2-1] FATAL: connection to client lost
2022-07-13 07:32:07.631 CEST [58398]: [51-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:09.320 CEST [58398]: [52-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:09.965 CEST [58398]: [53-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:12.309 CEST [58398]: [54-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:12.374 CEST [58398]: [55-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:12.383 CEST [58398]: [56-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:13.500 CEST [58398]: [57-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:14.348 CEST [58398]: [58-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:14.886 CEST [58398]: [59-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:17.325 CEST [58398]: [60-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:18.977 CEST [58398]: [61-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:19.717 CEST [58398]: [62-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:19.725 CEST [58398]: [63-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:20.179 CEST [58398]: [64-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:21.107 CEST [58398]: [65-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:22.036 CEST [58398]: [66-1] LOG: could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:24.931 CEST [55028]: [1-1] LOG: could not receive data from client: Connection reset by peer
2022-07-13 07:32:25.232 CEST [58398]: [67-1] LOG: received SIGHUP, reloading configuration files
2022-07-13 07:32:25.232 CEST [58398]: [68-1] LOG: parameter "recovery_target_timeline" cannot be changed without restarting the server
2022-07-13 07:32:25.241 CEST [58398]: [69-1] LOG: configuration file "/var/lib/pgsql/14/data/postgresql.conf" contains errors; unaffected changes were applied
2022-07-13 07:32:25.310 CEST [58398]: [70-1] LOG: received SIGHUP, reloading configuration files
2022-07-13 07:32:25.310 CEST [58398]: [71-1] LOG: parameter "recovery_target_timeline" cannot be changed without restarting the server
2022-07-13 07:32:25.310 CEST [58398]: [72-1] LOG: configuration file "/var/lib/pgsql/14/data/postgresql.conf" contains errors; unaffected changes were applied
pg_auto_failover events
Event Time | Node | Current State | Assigned State | Comment
-------------------------------+--------+---------------------+---------------------+-----------
2022-06-14 21:18:26.69914+02 | 0/1 | primary | primary | New state is reported by node 1 "node_1" (evaldb01.imagemaster.local:5432): "primary"
2022-07-13 07:31:24.049914+02 | 0/1 | primary | primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as unhealthy by the monitor
2022-07-13 07:31:35.898699+02 | 0/1 | primary | primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as healthy by the monitor
2022-07-13 07:32:12.287994+02 | 0/1 | primary | primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as unhealthy by the monitor
2022-07-13 07:32:24.364321+02 | 0/1 | primary | primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as healthy by the monitor
Hello,
We're evaluating
pg_auto_failover
on a small two nodes cluster without any real workload. From time to time we notice that the OOM Killer terminates thepg_auto_failover
process because it uses up all available memory. The issue doesn't happen that often, only once in a month or two, but we'd rather have it sorted out before completing our evaluation.I couldn't find anything relevant in the existing issues. Any help in narrowing it down is appreciated.
Version information
pg_auto_failover
PostgreSQL
OS
Logs
/var/log/messages
pg_auto_failover service
PostgreSQL
pg_auto_failover events