vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.27k stars 340 forks source link

postgrespro and postgresql_exists issue #594

Closed iabdukhoshimov closed 1 month ago

iabdukhoshimov commented 1 month ago

Hello! I have an issue with running the ansible setup so problem is that i am installing the postgrespro not just a postgres and i did the postgres_exists=true so that it won't install it again or make any conflicts but after all i got the error called like this

FALIED! changed:false elapsed:120 msg Timeout when waiting for host:8008

I have check the server with sudo lsof -i -P | grep 8008 it is open and also postgrespro is open to connections and i have checked the logs from cat /var/logs/messages and it is showing ok ?

vitabaks commented 1 month ago

Hello!

please check Patroni log

sudo journalctl -u patroni -n 100
vitabaks commented 1 month ago

FYI: example of installation PostgresPro https://github.com/vitabaks/postgresql_cluster/issues/38

iabdukhoshimov commented 1 month ago

I have fixed it but i got new error

FAILED! => {"changed": true, "cmd": ["/usr/pgsql-std-16/bin/pg_ctl", "status", "-D", "/var/lib/pgpro/std-16/data"], "delta": "0:00:00.004682", "end": "2024-03-06 19:30:49.738622", "msg": "non-zero return code", "rc": 3, "start": "2024-03-06 19:30:49.733940", "stderr": "", "stderr_lines": [], "stdout": "pg_ctl: no server running", "stdout_lines": ["pg_ctl: no server running"]}
vitabaks commented 1 month ago

in the context of which task did you receive the error? please attach the ansible log

iabdukhoshimov commented 1 month ago

So i have two error is switching i don't know what to do so , if i do

systemctl start postgrespro-16.service

i got the error called like this

timeout when waiting for the host:8008

but there is no any kind of negative logs just i got the success and fail

so if don't start the postgres than i got the not server is running

iabdukhoshimov commented 1 month ago

this is the logs i got from journalctl -u patroni -n 100

-- The unit user-runtime-dir@0.service has successfully entered the 'dead' state.
Mar 06 19:48:22 pgnode01 systemd[1]: Stopped User runtime directory /run/user/0.
-- Subject: Unit user-runtime-dir@0.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit user-runtime-dir@0.service has finished shutting down.
Mar 06 19:48:22 pgnode01 systemd[1]: Removed slice User Slice of UID 0.
-- Subject: Unit user-0.slice has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit user-0.slice has finished shutting down.
Mar 06 19:50:07 pgnode01 sudo[7477]: pam_unix(sudo:session): session closed for user root
Mar 06 19:50:25 pgnode01 sudo[7659]:     root : TTY=pts/1 ; PWD=/home/user ; USER=root ; COMMAND=/bin/journalctl -xe
Mar 06 19:50:25 pgnode01 sudo[7659]: pam_unix(sudo:session): session opened for user root by user(uid=0)
iabdukhoshimov commented 1 month ago

i guess i found something

cat /var/log/messages

and one thing i detected is that

2024-03-06 19:48:12,633 CRITICAL: system ID mismatch, node pgnode01 belongs to a different cluster: 7342431893358751004 != 7343108653944876472
Mar  6 19:48:13 localhost systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
vitabaks commented 1 month ago

system ID mismatch means that you are deploying a cluster with the same name that already exists in DCS.

Make sure that there are no conflicts in the cluster name if you have several Postgres clusters using the same DCS cluster (etcd/consul) or if it is a local DCS cluster in the same north as the database, then you can delete the cluster entry using patronictl remove

or just change the patroni cluster name before deployment.

iabdukhoshimov commented 1 month ago

Hi there again !

I found this error cause to if you have installed postgres before and you installed the postgres other version or again to one device then You will get this error. at my case i even did

sudo rm -rf /var/lib/pgsql/

but still got the cluster id error after that i tried this ansible to fresh server and i got my cluster

easy way to fix this is just erase the everything and Install from scratch