hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.13k stars 115 forks source link

Primary fails post upgrade (1.4.0->1.6.2) #810

Open gordonjb opened 3 years ago

gordonjb commented 3 years ago

Hi, I mainly want to check that I've a) followed the update procedure correctly and b) figure out how to proceed to get the cluster up again.

On following the instructions for upgrading from the docs, my primary node will not start. After restarting the monitor, I checked the status as suggested:

-bash-4.2$ /usr/pgsql-12/bin/pg_autoctl show state
  Name |  Node |                        Host:Port |        TLI: LSN |   Connection |       Current State |      Assigned State
-------+-------+----------------------------------+-----------------+--------------+---------------------+--------------------
node_1 |     1 | hostname-1:5432 |  61: 0/5CE70FE0 |       none ! |      demote_timeout |        wait_primary
node_2 |     2 | hostname-2:5432 |  61: 0/5CE70FA8 |  read-only ! |           secondary |          catchingup

The steps I ran were:

yum remove pg-auto-failover14_12.x86_64 on hostname-1 and hostname-2 yum install pg-auto-failover16_12-1.6.2-1.el7.x86_64.rpm on hostname-1 and hostname-2 systemctl restart pgautofailover-monitor.service on hostname-1 (on this development db, the monitor is also hosted on hostname-1)

As I understand the instructions, this should be sufficient to upgrade pg_auto_failover. However, as seen in the above state, node_1 seems to be stuck. I have pasted logs from node_1 starting at the time the upgrade steps began below:

Sep 17 16:49:22 hostname-1 pg_autoctl[994]: 16:49:22 1086 ERROR Query returned 0 rows, expected 1
Sep 17 16:49:22 hostname-1 pg_autoctl[994]: 16:49:22 1086 FATAL Failed to check version compatibility with the monitor extension "pgautofailover", see above for details
Sep 17 16:49:22 hostname-1 pg_autoctl[994]: 16:49:22 1086 ERROR Failed to remove pid file "/tmp/pg_autoctl/var/lib/pgsql/12/data/pg_autoctl_node-active.pid": No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 ERROR pg_autoctl service node-active exited with exit status 6
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 INFO  Restarting service node-active
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 18155 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 ERROR pg_autoctl service node-active exited with exit status 12
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 INFO  Restarting service node-active
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 18159 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 ERROR pg_autoctl service node-active exited with exit status 12
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 INFO  Restarting service node-active
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 18161 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 ERROR pg_autoctl service node-active exited with exit status 12
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 INFO  Restarting service node-active
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 18169 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 ERROR pg_autoctl service node-active exited with exit status 12
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 INFO  Restarting service node-active
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 18174 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: No such file or directory
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 ERROR pg_autoctl service node-active exited with exit status 12
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 994 FATAL pg_autoctl service node-active has already been restarted 5 times in the last 0 seconds, stopping now
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 1083 INFO  Postgres controller service received signal SIGTERM, terminating
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 1083 INFO  Stopping pg_autoctl postgres service
Sep 17 16:49:24 hostname-1 pg_autoctl[994]: 16:49:24 1083 INFO  /usr/pgsql-12/bin/pg_ctl --pgdata /var/lib/pgsql/12/data --wait stop --mode fast
Sep 17 16:49:25 hostname-1 pg_autoctl[994]: 16:49:25 994 INFO  Waiting for subprocesses to terminate.
Sep 17 16:49:25 hostname-1 pg_autoctl[994]: 16:49:25 1083 ERROR Failed to remove pid file "/tmp/pg_autoctl/var/lib/pgsql/12/data/pg_autoctl_postgres.pid": No such file or directory
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service: main process exited, code=exited, status=12/n/a
Sep 17 16:49:26 hostname-1 systemd[1]: Unit pgautofailover.service entered failed state.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service failed.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service holdoff time over, scheduling restart.
Sep 17 16:49:26 hostname-1 systemd[1]: Stopped pg_auto_failover.
Sep 17 16:49:26 hostname-1 systemd[1]: Started pg_auto_failover.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service: main process exited, code=exited, status=203/EXEC
Sep 17 16:49:26 hostname-1 systemd[1]: Unit pgautofailover.service entered failed state.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service failed.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service holdoff time over, scheduling restart.
Sep 17 16:49:26 hostname-1 systemd[1]: Stopped pg_auto_failover.
Sep 17 16:49:26 hostname-1 systemd[1]: Started pg_auto_failover.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service: main process exited, code=exited, status=203/EXEC
Sep 17 16:49:26 hostname-1 systemd[1]: Unit pgautofailover.service entered failed state.
Sep 17 16:49:26 hostname-1 systemd[1]: pgautofailover.service failed.
Sep 17 16:49:27 hostname-1 systemd[1]: pgautofailover.service holdoff time over, scheduling restart.
Sep 17 16:49:27 hostname-1 systemd[1]: Stopped pg_auto_failover.
Sep 17 16:49:27 hostname-1 systemd[1]: Started pg_auto_failover.
Sep 17 16:49:27 hostname-1 systemd[1]: pgautofailover.service: main process exited, code=exited, status=203/EXEC

This repeats until (I assume) the 1.6 rpm finished installling:

Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21035 INFO  Started pg_autoctl postgres service with pid 21047
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21047 INFO   /usr/pgsql-12/bin/pg_autoctl do service postgres --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21035 INFO  Started pg_autoctl node-active service with pid 21048
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR Monitor ERROR:  loaded "pgautofailover" library version differs from latest available extension version
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR Monitor DETAIL:  Loaded library requires 1.4, but the latest control file specifies 1.6.
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR Monitor HINT:  Restart the database to load the latest version of the "pgautofailover" library.
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR SQL query: SELECT * FROM pgautofailover.get_nodes($1, $2) WHERE node_id = $3
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR SQL params: 'default', '0', '1'
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR Failed to get nodes for group 0 in formation "default" from the monitor
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21048 ERROR Failed to query monitor to see if node id 1 has been dropped already
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21035 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21035 INFO  Restarting service node-active
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR Monitor ERROR:  loaded "pgautofailover" library version differs from latest available extension version
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR Monitor DETAIL:  Loaded library requires 1.4, but the latest control file specifies 1.6.
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR Monitor HINT:  Restart the database to load the latest version of the "pgautofailover" library.
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR SQL query: SELECT * FROM pgautofailover.get_nodes($1, $2) WHERE node_id = $3
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR SQL params: 'default', '0', '1'
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR Failed to get nodes for group 0 in formation "default" from the monitor
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21054 ERROR Failed to query monitor to see if node id 1 has been dropped already
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21035 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21035 INFO  Restarting service node-active
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR Monitor ERROR:  loaded "pgautofailover" library version differs from latest available extension version
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR Monitor DETAIL:  Loaded library requires 1.4, but the latest control file specifies 1.6.
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR Monitor HINT:  Restart the database to load the latest version of the "pgautofailover" library.
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR SQL query: SELECT * FROM pgautofailover.get_nodes($1, $2) WHERE node_id = $3
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR SQL params: 'default', '0', '1'
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR Failed to get nodes for group 0 in formation "default" from the monitor
Sep 17 16:50:21 hostname-1 pg_autoctl[21035]: 16:50:21 21062 ERROR Failed to query monitor to see if node id 1 has been dropped already
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 INFO  Restarting service node-active
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR Monitor ERROR:  loaded "pgautofailover" library version differs from latest available extension version
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR Monitor DETAIL:  Loaded library requires 1.4, but the latest control file specifies 1.6.
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR Monitor HINT:  Restart the database to load the latest version of the "pgautofailover" library.
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR SQL query: SELECT * FROM pgautofailover.get_nodes($1, $2) WHERE node_id = $3
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR SQL params: 'default', '0', '1'
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR Failed to get nodes for group 0 in formation "default" from the monitor
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21067 ERROR Failed to query monitor to see if node id 1 has been dropped already
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 INFO  Restarting service node-active
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR Monitor ERROR:  loaded "pgautofailover" library version differs from latest available extension version
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR Monitor DETAIL:  Loaded library requires 1.4, but the latest control file specifies 1.6.
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR Monitor HINT:  Restart the database to load the latest version of the "pgautofailover" library.
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR SQL query: SELECT * FROM pgautofailover.get_nodes($1, $2) WHERE node_id = $3
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR SQL params: 'default', '0', '1'
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR Failed to get nodes for group 0 in formation "default" from the monitor
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21074 ERROR Failed to query monitor to see if node id 1 has been dropped already
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 INFO  Restarting service node-active
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR Monitor ERROR:  loaded "pgautofailover" library version differs from latest available extension version
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR Monitor DETAIL:  Loaded library requires 1.4, but the latest control file specifies 1.6.
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR Monitor HINT:  Restart the database to load the latest version of the "pgautofailover" library.
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR SQL query: SELECT * FROM pgautofailover.get_nodes($1, $2) WHERE node_id = $3
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR SQL params: 'default', '0', '1'
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR Failed to get nodes for group 0 in formation "default" from the monitor
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21083 ERROR Failed to query monitor to see if node id 1 has been dropped already
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 FATAL pg_autoctl service node-active has already been restarted 5 times in the last 1 seconds, stopping now
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21047 INFO  Postgres controller service received signal SIGTERM, terminating
Sep 17 16:50:22 hostname-1 pg_autoctl[21035]: 16:50:22 21035 FATAL Something went wrong in sub-process supervision, stopping now. See above for details.
Sep 17 16:50:22 hostname-1 systemd[1]: pgautofailover.service: main process exited, code=exited, status=12/n/a
Sep 17 16:50:22 hostname-1 systemd[1]: Unit pgautofailover.service entered failed state.
Sep 17 16:50:22 hostname-1 systemd[1]: pgautofailover.service failed.
Sep 17 16:50:22 hostname-1 systemd[1]: pgautofailover.service holdoff time over, scheduling restart.
Sep 17 16:50:22 hostname-1 systemd[1]: Stopped pg_auto_failover.
Sep 17 16:50:22 hostname-1 systemd[1]: Started pg_auto_failover.

This also repeats a few times, until the point at which the monitor service is restarted:

Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4808 WARN  pg_autoctl service node-active exited with exit status 0
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4808 INFO  Restarting service node-active
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 INFO  Reloaded the new configuration from "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/12/data/pg_autoctl.cfg"
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 INFO  pg_autoctl service is running, current state is "primary"
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 WARN  Failed to fetch current replication properties from standby node: no standby connected in pg_stat_replication.
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 WARN  HINT: check pg_autoctl and Postgres logs on standby nodes
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 WARN  Failed to update the keeper's state from the local PostgreSQL instance.
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Monitor FATAL:  terminating connection due to administrator command
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Monitor server closed the connection unexpectedly
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Monitor         This probably means the server terminated abnormally
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Monitor         before or while processing the request.
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR SQL query: SELECT default_version, installed_version  FROM pg_available_extensions WHERE name = $1
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR SQL params: 'pgautofailover'
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Failed to get the current version for extension "pgautofailover", see previous lines for details.
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 FATAL Failed to check version compatibility with the monitor extension "pgautofailover", see above for details
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 WARN  pg_autoctl version "1.6.2" requires monitor extension version "1.6" and current version on the monitor is ""
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Failed to parse Postgres version number ""
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 ERROR Failed to get the goal state from the monitor
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 WARN  Checking for network partitions...
Sep 17 16:53:36 hostname-1 pg_autoctl[4808]: 16:53:36 4865 WARN  Failed to connect to "postgres://@:5432/postgres?", retrying until the server is ready
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 ERROR Connection to database failed: could not connect to server: No such file or directory
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 ERROR         Is the server running locally and accepting
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 ERROR         connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 ERROR Failed to connect to "postgres://@:5432/postgres?" after 24 attempts in 2763 ms, pg_autoctl stops retrying now
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Failed to contact the monitor or standby in 255 seconds, at 20 seconds we shut down PostgreSQL to prevent split brain issues
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Network in not healthy, switching to state demote_timeout
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Reaching new state "demote_timeout"
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  FSM transition from "primary" to "demote_timeout": A failover occurred, no longer primary
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Transition complete: current state is now "demote_timeout"
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Updated the keeper's state from the local PostgreSQL instance, which is not running
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes.
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Ensuring HBA rules for node 2 "node_2" (hostname-2:5432)
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Successfully got the goal state from the monitor
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 INFO  Monitor assigned new state "wait_primary"
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 FATAL pg_autoctl does not know how to reach state "wait_primary" from "demote_timeout"
Sep 17 16:53:39 hostname-1 pg_autoctl[4808]: 16:53:39 4865 ERROR Failed to transition to state "wait_primary", retrying...
Sep 17 16:53:40 hostname-1 pg_autoctl[4808]: 16:53:40 4865 INFO  Monitor assigned new state "wait_primary"
Sep 17 16:53:40 hostname-1 pg_autoctl[4808]: 16:53:40 4865 FATAL pg_autoctl does not know how to reach state "wait_primary" from "demote_timeout"

And then node_1 has been stuck here ever since. Restarting pgautofailover service results in the same loop:

Sep 17 18:10:49 hostname-1 pg_autoctl[4808]: 18:10:49 4808 INFO  pg_autoctl received signal SIGTERM, terminating
Sep 17 18:10:49 hostname-1 systemd[1]: Stopping pg_auto_failover...
Sep 17 18:10:49 hostname-1 pg_autoctl[4808]: 18:10:49 4814 INFO  Postgres controller service received signal SIGTERM, terminating
Sep 17 18:10:50 hostname-1 pg_autoctl[4808]: 18:10:50 4808 INFO  Stop pg_autoctl
Sep 17 18:10:50 hostname-1 systemd[1]: Stopped pg_auto_failover.
Sep 17 18:10:50 hostname-1 systemd[1]: Started pg_auto_failover.
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13066 INFO  Started pg_autoctl postgres service with pid 13069
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13069 INFO   /usr/pgsql-12/bin/pg_autoctl do service postgres --pgdata /var/lib/pgsql/12/data -v
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13066 INFO  Started pg_autoctl node-active service with pid 13070
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 INFO   /usr/pgsql-12/bin/pg_autoctl do service node-active --pgdata /var/lib/pgsql/12/data -v
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 INFO  Reloaded the new configuration from "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/12/data/pg_autoctl.cfg"
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 INFO  pg_autoctl service is running, current state is "demote_timeout"
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes.
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 INFO  Ensuring HBA rules for node 2 "node_2" (hostname-2:5432)
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 INFO  Monitor assigned new state "wait_primary"
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 FATAL pg_autoctl does not know how to reach state "wait_primary" from "demote_timeout"
Sep 17 18:10:50 hostname-1 pg_autoctl[13066]: 18:10:50 13070 ERROR Failed to transition to state "wait_primary", retrying...

I have full logs from the monitor, node_1 and node_2 if required.

gordonjb commented 3 years ago

The monitor crashed over the weekend.

I can run it manually by running /usr/pgsql-12/bin/pg_autoctl run --pgdata ~/monitor from the postgres user, but trying to run the monitor service fails every time, leaving an orphaned semaphore I had to clean up. I edited the service to run pg_autoctl with -vvv, and got the following:

Sep 20 17:21:40 hostname-1 systemd[1]: Started pg_auto_failover monitor process.
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:166 SetConfigFilePath: "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.cfg"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:194 SetStateFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.state"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:209 SetKeeperStateFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.init"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:237 SetNodesFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/nodes.json"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:263 SetPidFilePath: "/tmp/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.pid"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:166 SetConfigFilePath: "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.cfg"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:194 SetStateFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.state"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:209 SetKeeperStateFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.init"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:237 SetNodesFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/nodes.json"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:263 SetPidFilePath: "/tmp/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.pid"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 DEBUG config.c:287 Probing configuration file "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.cfg"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 pg_autoctl.role = monitor
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 DEBUG config.c:320 ProbeConfigurationFileRole: monitor
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:166 SetConfigFilePath: "/var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.cfg"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:194 SetStateFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.state"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:209 SetKeeperStateFilePath: "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.init"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE config.c:263 SetPidFilePath: "/tmp/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.pid"
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 DEBUG monitor_config.c:256 Reading configuration from /var/lib/pgsql/.config/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.cfg
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 pg_autoctl.role = monitor
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 pg_autoctl.hostname = hostname-1.fqdn
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.pgdata = /var/lib/pgsql/monitor
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.pg_ctl = /usr/pgsql-12/bin/pg_ctl
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.username = autoctl_node
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.dbname = pg_auto_failover
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.port = 5433
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.listen_addresses = *
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 postgresql.auth_method = trust
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 ssl.sslmode = prefer
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE ini_file.c:131 ssl.active = 0
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 DEBUG pgsetup.c:122 pg_setup_init: /usr/pgsql-12/bin/pg_ctl version 12.4
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE pgsetup.c:454 Failed to open file "/var/lib/pgsql/monitor/postmaster.pid": No such file or directory
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE pgsetup.c:454 Failed to open file "/var/lib/pgsql/monitor/postmaster.pid": No such file or directory
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 DEBUG pgctl.c:212 /usr/pgsql-12/bin/pg_controldata /var/lib/pgsql/monitor
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 DEBUG pgsetup.c:392 Found PostgreSQL system 6882784943118605993 at "/var/lib/pgsql/monitor", version 1201, catalog version 201909212
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE primary_standby.c:161 local_postgres_set_status_path: /var/lib/pgsql/monitor
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE primary_standby.c:175 local_postgres_set_status_path: /tmp/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.pg
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE primary_standby.c:198 local_postgres_unlink_status_file: /tmp/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.pg
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE supervisor.c:452 supervisor_init
Sep 20 17:21:40 hostname-1 pg_autoctl[5302]: 17:21:40 5302 TRACE signals.c:37 set_signal_handlers
Sep 20 17:21:40 hostname-1 systemd[1]: pgautofailover-monitor.service: main process exited, code=dumped, status=11/SEGV
Sep 20 17:21:40 hostname-1 systemd[1]: Unit pgautofailover-monitor.service entered failed state.
Sep 20 17:21:40 hostname-1 systemd[1]: pgautofailover-monitor.service failed.

EDIT: The monitor was running fine when run manually. After comparing logs for these two, I removed a pid file from /tmp/pg_autoctl/var/lib/pgsql/monitor/pg_autoctl.pid, and then the service started as expected. My original issue still stands however.