Hello,

We're evaluating pg_auto_failover on a small two nodes cluster without any real workload. From time to time we notice that the OOM Killer terminates the pg_auto_failover process because it uses up all available memory. The issue doesn't happen that often, only once in a month or two, but we'd rather have it sorted out before completing our evaluation.

I couldn't find anything relevant in the existing issues. Any help in narrowing it down is appreciated.

Version information

pg_auto_failover

$ /usr/pgsql-14/bin/pg_autoctl --version
pg_autoctl version 1.6.4
pg_autoctl extension version 1.6
compiled with PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
compatible with Postgres 10, 11, 12, 13, and 14

PostgreSQL

postgres=# select version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 14.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)

OS

$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

$ uname -a
Linux evaldb01 3.10.0-1160.53.1.el7.x86_64 #1 SMP Fri Jan 14 13:59:45 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Logs

/var/log/messages

Jul 13 07:32:23 evaldb01 kernel: Mem-Info:
Jul 13 07:32:23 evaldb01 kernel: active_anon:1603307 inactive_anon:313995 isolated_anon:0#012 active_file:708 inactive_file:3290 isolated_file:0#012 unevictable:0 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:5594 slab_unreclaimable:8406#012 mapped:1234 shmem:1254 pagetables:10045 bounce:0#012 free:25774 free_pcp:473 free_cma:0
Jul 13 07:32:23 evaldb01 kernel: Node 0 DMA free:15900kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jul 13 07:32:23 evaldb01 kernel: lowmem_reserve[]: 0 2830 7802 7802
Jul 13 07:32:23 evaldb01 kernel: Node 0 DMA32 free:44268kB min:24464kB low:30580kB high:36696kB active_anon:2238568kB inactive_anon:560084kB active_file:0kB inactive_file:5700kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3128732kB managed:2898240kB mlocked:0kB dirty:0kB writeback:0kB mapped:2412kB shmem:2940kB slab_reclaimable:5932kB slab_unreclaimable:11572kB kernel_stack:448kB pagetables:15620kB unstable:0kB bounce:0kB free_pcp:916kB local_pcp:252kB free_cma:0kB writeback_tmp:0kB pages_scanned:33814 all_unreclaimable? yes
Jul 13 07:32:23 evaldb01 kernel: lowmem_reserve[]: 0 0 4972 4972
Jul 13 07:32:23 evaldb01 kernel: Node 0 Normal free:42928kB min:42980kB low:53724kB high:64468kB active_anon:4174660kB inactive_anon:695768kB active_file:2832kB inactive_file:7460kB unevictable:0kB isolated(anon):128kB isolated(file):0kB present:5242880kB managed:5094564kB mlocked:0kB dirty:0kB writeback:0kB mapped:2524kB shmem:2076kB slab_reclaimable:16444kB slab_unreclaimable:22044kB kernel_stack:1936kB pagetables:24560kB unstable:0kB bounce:0kB free_pcp:976kB local_pcp:244kB free_cma:0kB writeback_tmp:0kB pages_scanned:25504 all_unreclaimable? yes
Jul 13 07:32:23 evaldb01 kernel: lowmem_reserve[]: 0 0 0 0
Jul 13 07:32:23 evaldb01 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kB
Jul 13 07:32:23 evaldb01 kernel: Node 0 DMA32: 63*4kB (UE) 236*8kB (EM) 227*16kB (UEM) 207*32kB (UEM) 108*64kB (UEM) 35*128kB (UEM) 20*256kB (UEM) 8*512kB (UEM) 9*1024kB (UEM) 1*2048kB (U) 0*4096kB = 44268kB
Jul 13 07:32:23 evaldb01 kernel: Node 0 Normal: 227*4kB (UE) 484*8kB (UEM) 379*16kB (UE) 229*32kB (UE) 129*64kB (UE) 51*128kB (UEM) 15*256kB (UEM) 2*512kB (U) 5*1024kB (UM) 0*2048kB 0*4096kB = 42940kB
Jul 13 07:32:23 evaldb01 kernel: 5799 total pagecache pages
Jul 13 07:32:23 evaldb01 kernel: 489 pages in swap cache
Jul 13 07:32:23 evaldb01 kernel: Swap cache stats: add 562343431, delete 562345885, find 2765049937/2856509728
Jul 13 07:32:23 evaldb01 kernel: Free swap  = 0kB
Jul 13 07:32:23 evaldb01 kernel: Total swap = 8388604kB
Jul 13 07:32:23 evaldb01 kernel: 2096902 pages RAM
Jul 13 07:32:23 evaldb01 kernel: 0 pages HighMem/MovableOnly
Jul 13 07:32:23 evaldb01 kernel: 94724 pages reserved
Jul 13 07:32:24 evaldb01 kernel: Out of memory: Kill process 16845 (pg_autoctl) score 947 or sacrifice child
Jul 13 07:32:24 evaldb01 kernel: Killed process 16845 (pg_autoctl), UID 26, total-vm:15583064kB, anon-rss:7601440kB, file-rss:628kB, shmem-rss:0kB

pg_auto_failover service

Jul 13 07:31:39 evaldb01 pg_autoctl[58364]: 07:31:39 16845 WARN  Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:31:39 evaldb01 pg_autoctl[58364]: 07:31:39 16845 INFO  Successfully connected to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 2 attempts in 1891 ms.
Jul 13 07:32:03 evaldb01 pg_autoctl[58364]: 07:32:02 16845 WARN  Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:02 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: timeout expired
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:03 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 1 attempts in 2339 ms, pg_autoctl stops retrying now
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:03 16845 ERROR Failed to maintain replication slots on the local Postgres instance, see above for details
Jul 13 07:32:04 evaldb01 pg_autoctl[58364]: 07:32:03 16845 WARN  pg_autoctl failed to ensure current state "primary": PostgreSQL is running
Jul 13 07:32:08 evaldb01 pg_autoctl[58364]: 07:32:08 16845 WARN  Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:10 evaldb01 pg_autoctl[58364]: 07:32:09 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for co
Jul 13 07:32:11 evaldb01 pg_autoctl[58364]: 07:32:09 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 2 attempts in 3147 ms, pg_autoctl stops retrying now
Jul 13 07:32:12 evaldb01 pg_autoctl[58364]: 07:32:10 16845 WARN  Failed to update the local Postgres metadata
Jul 13 07:32:12 evaldb01 pg_autoctl[58364]: 07:32:10 16845 WARN  Failed to update the keeper's state from the local PostgreSQL instance.
Jul 13 07:32:12 evaldb01 pg_autoctl[58364]: 07:32:12 16845 WARN  Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:15 evaldb01 pg_autoctl[58364]: 07:32:14 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for co
Jul 13 07:32:16 evaldb01 pg_autoctl[58364]: 07:32:14 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 4 attempts in 2225 ms, pg_autoctl stops retrying now
Jul 13 07:32:16 evaldb01 pg_autoctl[58364]: 07:32:15 16845 ERROR Failed to maintain replication slots on the local Postgres instance, see above for details
Jul 13 07:32:16 evaldb01 pg_autoctl[58364]: 07:32:15 16845 WARN  pg_autoctl failed to ensure current state "primary": PostgreSQL is running
Jul 13 07:32:19 evaldb01 pg_autoctl[58364]: 07:32:19 16845 WARN  Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?", retrying until the server is ready
Jul 13 07:32:22 evaldb01 pg_autoctl[58364]: 07:32:21 16845 ERROR Connection to database failed: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for co
Jul 13 07:32:23 evaldb01 pg_autoctl[58364]: 07:32:21 16845 ERROR Failed to connect to "postgres://postgres@/var/run/postgresql:5432/postgres?" after 4 attempts in 2683 ms, pg_autoctl stops retrying now
Jul 13 07:32:23 evaldb01 pg_autoctl[58364]: 07:32:21 16845 WARN  Failed to update the local Postgres metadata
Jul 13 07:32:23 evaldb01 pg_autoctl[58364]: 07:32:22 16845 WARN  Failed to update the keeper's state from the local PostgreSQL instance.
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 58364 ERROR pg_autoctl service node-active exited after receiving signal Killed
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 58364 INFO  Restarting service node-active
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO   /usr/pgsql-14/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/14/data -v
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO  Reloaded the new configuration from "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/14/data/pg_autoctl.cfg"
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO  Reloading Postgres configuration and HBA rules
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO  pg_autoctl service is running, current state is "primary"
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes.
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO  Ensuring HBA rules for node 2 "node_2" (evaldb02.imagemaster.local:5432)
Jul 13 07:32:25 evaldb01 pg_autoctl[58364]: 07:32:25 55074 INFO  Reloading Postgres configuration and HBA rules

PostgreSQL

2022-07-13 01:00:04.649 CEST [58401]: [201-1] LOG:  checkpoint starting: immediate force wait
2022-07-13 01:00:05.324 CEST [58401]: [202-1] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.554 s, sync=0.002 s, total=0.685 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16384 kB
2022-07-13 01:30:04.415 CEST [58401]: [203-1] LOG:  checkpoint starting: time
2022-07-13 01:30:04.423 CEST [58401]: [204-1] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.008 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB, estimate=16384 kB
2022-07-13 07:31:34.924 CEST [55023]: [1-1] FATAL:  too many connections for role "pgautofailover_monitor"
2022-07-13 07:32:04.922 CEST [58398]: [49-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:07.259 CEST [58398]: [50-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:07.364 CEST [55063]: [1-1] LOG:  could not send data to client: Broken pipe
2022-07-13 07:32:07.537 CEST [55063]: [2-1] FATAL:  connection to client lost
2022-07-13 07:32:07.631 CEST [58398]: [51-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:09.320 CEST [58398]: [52-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:09.965 CEST [58398]: [53-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:12.309 CEST [58398]: [54-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:12.374 CEST [58398]: [55-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:12.383 CEST [58398]: [56-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:13.500 CEST [58398]: [57-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:14.348 CEST [58398]: [58-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:14.886 CEST [58398]: [59-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:17.325 CEST [58398]: [60-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:18.977 CEST [58398]: [61-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:19.717 CEST [58398]: [62-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:19.725 CEST [58398]: [63-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:20.179 CEST [58398]: [64-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:21.107 CEST [58398]: [65-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:22.036 CEST [58398]: [66-1] LOG:  could not fork new process for connection: Cannot allocate memory
2022-07-13 07:32:24.931 CEST [55028]: [1-1] LOG:  could not receive data from client: Connection reset by peer
2022-07-13 07:32:25.232 CEST [58398]: [67-1] LOG:  received SIGHUP, reloading configuration files
2022-07-13 07:32:25.232 CEST [58398]: [68-1] LOG:  parameter "recovery_target_timeline" cannot be changed without restarting the server
2022-07-13 07:32:25.241 CEST [58398]: [69-1] LOG:  configuration file "/var/lib/pgsql/14/data/postgresql.conf" contains errors; unaffected changes were applied
2022-07-13 07:32:25.310 CEST [58398]: [70-1] LOG:  received SIGHUP, reloading configuration files
2022-07-13 07:32:25.310 CEST [58398]: [71-1] LOG:  parameter "recovery_target_timeline" cannot be changed without restarting the server
2022-07-13 07:32:25.310 CEST [58398]: [72-1] LOG:  configuration file "/var/lib/pgsql/14/data/postgresql.conf" contains errors; unaffected changes were applied

pg_auto_failover events

                    Event Time |   Node |       Current State |      Assigned State | Comment
-------------------------------+--------+---------------------+---------------------+-----------
  2022-06-14 21:18:26.69914+02 |    0/1 |             primary |             primary | New state is reported by node 1 "node_1" (evaldb01.imagemaster.local:5432): "primary"
 2022-07-13 07:31:24.049914+02 |    0/1 |             primary |             primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as unhealthy by the monitor
 2022-07-13 07:31:35.898699+02 |    0/1 |             primary |             primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as healthy by the monitor
 2022-07-13 07:32:12.287994+02 |    0/1 |             primary |             primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as unhealthy by the monitor
 2022-07-13 07:32:24.364321+02 |    0/1 |             primary |             primary | Node node 1 "node_1" (evaldb01.imagemaster.local:5432) is marked as healthy by the monitor

hapostgres / pg_auto_failover

pg_auto_failover is terminated by the OOM Killer #914

Version information

pg_auto_failover

PostgreSQL

OS

Logs

/var/log/messages

pg_auto_failover service

PostgreSQL

pg_auto_failover events