hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.09k stars 114 forks source link

Missing pg_autoctl.state file #840

Closed schinns closed 2 years ago

schinns commented 2 years ago

I have successfully setup monitor. I am creating a node and running into this error:

postgres@primary:~$ pg_autoctl create postgres --hostname 192.168.56.4 --pgport 5433 --dbname autodb --auth trust --no-ssl --monitor postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover  --run
23:17:56 2159 WARN  No encryption is used for network traffic! This allows an attacker on the network to read all replication data.
23:17:56 2159 WARN  Using --ssl-self-signed instead of --no-ssl is recommend to achieve more security with the same ease of deployment.
23:17:56 2159 WARN  See https://www.postgresql.org/docs/current/libpq-ssl.html for details on how to improve
23:17:56 2159 INFO  Using default --ssl-mode "prefer"
23:17:56 2159 INFO  Started pg_autoctl postgres service with pid 2162
23:17:56 2159 INFO  Started pg_autoctl node-active service with pid 2163
23:17:56 2163 ERROR Failed to open file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state": No such file or directory
23:17:56 2163 ERROR Failed to read Keeper state from file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state"
23:17:56 2162 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/13/primary -v
23:17:56 2159 ERROR pg_autoctl service node-active exited with exit status 12
23:17:56 2159 INFO  Restarting service node-active
23:17:56 2166 ERROR Failed to open file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state": No such file or directory
23:17:56 2166 ERROR Failed to read Keeper state from file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state"
23:17:56 2159 ERROR pg_autoctl service node-active exited with exit status 12
23:17:56 2159 INFO  Restarting service node-active
23:17:56 2167 ERROR Failed to open file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state": No such file or directory
23:17:56 2167 ERROR Failed to read Keeper state from file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state"
23:17:56 2159 ERROR pg_autoctl service node-active exited with exit status 12
23:17:56 2159 INFO  Restarting service node-active
23:17:56 2168 ERROR Failed to open file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state": No such file or directory
23:17:56 2168 ERROR Failed to read Keeper state from file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state"
23:17:56 2159 ERROR pg_autoctl service node-active exited with exit status 12
23:17:56 2159 INFO  Restarting service node-active
23:17:56 2169 ERROR Failed to open file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state": No such file or directory
23:17:56 2169 ERROR Failed to read Keeper state from file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state"
23:17:56 2159 ERROR pg_autoctl service node-active exited with exit status 12
23:17:56 2159 INFO  Restarting service node-active
23:17:56 2170 ERROR Failed to open file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state": No such file or directory
23:17:56 2170 ERROR Failed to read Keeper state from file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state"
23:17:56 2159 ERROR pg_autoctl service node-active exited with exit status 12
23:17:56 2159 FATAL pg_autoctl service node-active has already been restarted 5 times in the last 0 seconds, stopping now
23:17:56 2162 INFO  Postgres controller service received signal SIGTERM, terminating
23:17:56 2159 FATAL Something went wrong in sub-process supervision, stopping now. See above for details.
23:17:56 2159 INFO  Stop pg_autoctl

I understand pg_autoctl.state is a binary file, and is supposed to be created by pg_autoctl. What are you supposed to do when pg_autoctl.state file does not exist? I also noticed that the state file does not exist for the monitor node. Please advise. TIA.

rheaton commented 2 years ago

Is that location (/var/lib/postgresql/) editable by the postgres user? Perhaps pg_autoctl does not have permission to create the file.

schinns commented 2 years ago

I have played around with permissions and see a Permission denied specific error:

vagrant@primary:~$ sudo chmod 644 -R /var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/
vagrant@primary:~$ sudo su - postgres
postgres@primary:~$ export PGDATA=/var/lib/postgresql/13/primary
postgres@primary:~$ pg_autoctl create postgres --hostname 192.168.56.4 --pgport 5433 --dbname autodb --auth trust --no-ssl --monitor postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover  --run
23:28:11 2198 WARN  No encryption is used for network traffic! This allows an attacker on the network to read all replication data.
23:28:11 2198 WARN  Using --ssl-self-signed instead of --no-ssl is recommend to achieve more security with the same ease of deployment.
23:28:11 2198 WARN  See https://www.postgresql.org/docs/current/libpq-ssl.html for details on how to improve
23:28:11 2198 INFO  Using default --ssl-mode "prefer"
23:28:11 2198 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:11 2198 INFO  Started pg_autoctl postgres service with pid 2201
23:28:11 2198 INFO  Started pg_autoctl node-active service with pid 2202
23:28:11 2202 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.init" exists: Permission denied
23:28:11 2202 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:11 2202 INFO  Registering Postgres system 7036120085397108718 found at "/var/lib/postgresql/13/primary"
23:28:11 2202 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:11 2202 ERROR Failed to remove file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state.new": Permission denied
23:28:11 2202 FATAL Failed to create a state file prior to registering the node with the monitor, see above for details
23:28:11 2202 ERROR Failed to register the existing local Postgres node "192.168.56.4:5433" running at "/var/lib/postgresql/13/primary"to the pg_auto_failover monitor at postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover?, see above for details
23:28:11 2201 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/13/primary -v
23:28:12 2198 ERROR pg_autoctl service node-active exited with exit status 12
23:28:12 2198 INFO  Restarting service node-active
23:28:12 2205 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.init" exists: Permission denied
23:28:12 2205 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2205 INFO  Registering Postgres system 7036120085397108718 found at "/var/lib/postgresql/13/primary"
23:28:12 2205 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2205 ERROR Failed to remove file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state.new": Permission denied
23:28:12 2205 FATAL Failed to create a state file prior to registering the node with the monitor, see above for details
23:28:12 2205 ERROR Failed to register the existing local Postgres node "192.168.56.4:5433" running at "/var/lib/postgresql/13/primary"to the pg_auto_failover monitor at postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover?, see above for details
23:28:12 2198 ERROR pg_autoctl service node-active exited with exit status 12
23:28:12 2198 INFO  Restarting service node-active
23:28:12 2206 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.init" exists: Permission denied
23:28:12 2206 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2206 INFO  Registering Postgres system 7036120085397108718 found at "/var/lib/postgresql/13/primary"
23:28:12 2206 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2206 ERROR Failed to remove file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state.new": Permission denied
23:28:12 2206 FATAL Failed to create a state file prior to registering the node with the monitor, see above for details
23:28:12 2206 ERROR Failed to register the existing local Postgres node "192.168.56.4:5433" running at "/var/lib/postgresql/13/primary"to the pg_auto_failover monitor at postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover?, see above for details
23:28:12 2198 ERROR pg_autoctl service node-active exited with exit status 12
23:28:12 2198 INFO  Restarting service node-active
23:28:12 2207 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.init" exists: Permission denied
23:28:12 2207 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2207 INFO  Registering Postgres system 7036120085397108718 found at "/var/lib/postgresql/13/primary"
23:28:12 2207 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2207 ERROR Failed to remove file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state.new": Permission denied
23:28:12 2207 FATAL Failed to create a state file prior to registering the node with the monitor, see above for details
23:28:12 2207 ERROR Failed to register the existing local Postgres node "192.168.56.4:5433" running at "/var/lib/postgresql/13/primary"to the pg_auto_failover monitor at postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover?, see above for details
23:28:12 2198 ERROR pg_autoctl service node-active exited with exit status 12
23:28:12 2198 INFO  Restarting service node-active
23:28:12 2208 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.init" exists: Permission denied
23:28:12 2208 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2208 INFO  Registering Postgres system 7036120085397108718 found at "/var/lib/postgresql/13/primary"
23:28:12 2208 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2208 ERROR Failed to remove file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state.new": Permission denied
23:28:12 2208 FATAL Failed to create a state file prior to registering the node with the monitor, see above for details
23:28:12 2208 ERROR Failed to register the existing local Postgres node "192.168.56.4:5433" running at "/var/lib/postgresql/13/primary"to the pg_auto_failover monitor at postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover?, see above for details
23:28:12 2198 ERROR pg_autoctl service node-active exited with exit status 12
23:28:12 2198 INFO  Restarting service node-active
23:28:12 2209 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.init" exists: Permission denied
23:28:12 2209 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2209 INFO  Registering Postgres system 7036120085397108718 found at "/var/lib/postgresql/13/primary"
23:28:12 2209 ERROR Failed to check if file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state" exists: Permission denied
23:28:12 2209 ERROR Failed to remove file "/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/13/primary/pg_autoctl.state.new": Permission denied
23:28:12 2209 FATAL Failed to create a state file prior to registering the node with the monitor, see above for details
23:28:12 2209 ERROR Failed to register the existing local Postgres node "192.168.56.4:5433" running at "/var/lib/postgresql/13/primary"to the pg_auto_failover monitor at postgres://autoctl_node@10.0.2.2:5003/pg_auto_failover?, see above for details
23:28:12 2198 ERROR pg_autoctl service node-active exited with exit status 12
23:28:12 2198 FATAL pg_autoctl service node-active has already been restarted 5 times in the last 0 seconds, stopping now
23:28:12 2201 INFO  Postgres controller service received signal SIGTERM, terminating
23:28:12 2198 FATAL Something went wrong in sub-process supervision, stopping now. See above for details.
23:28:12 2198 INFO  Stop pg_autoctl
rheaton commented 2 years ago

AFAIK, the monitor does not store one of these. It should be initialized here for any data nodes.

When playing around with pg_autoctl in development, I make sure to clean up everything in my data-directories $HOME/.local/share/pgautoctl and $HOME/.config/pg_autoctl or things can get in a very confusing state.

I don't think I've seen the error you're seeing.

DimCitus commented 2 years ago

Hi @benschinn ; it seems to me that you have an “init state file” that exists and doesn't match the current situation of the local node. It could be explained by a previous run of pg_autoctl create postgres in the same PGDATA that failed for some reasons, followed by a manual or at least partial clean-up, before trying again. Does that match your situation?

schinns commented 2 years ago

thank you @rheaton and @DimCitus cleaning up helped!