vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on Patroni). Automating with Ansible.
https://postgresql-cluster.org
MIT License
1.5k stars 396 forks source link

Stuck at Wait for port 8008 to become open on the host #738

Open snoby opened 3 weeks ago

snoby commented 3 weeks ago

This is an amazing ansible project that you've put together here, very impressive. I've set the project up to enable haproxy, cluster_vip and etcd, changed running on premise ( will run in aws but i don't want to use their load balancers or anything like that). I'm running it in a lab right now.

I worked through a couple of issues that i had with the install, specifically I had to disable the following items as my minimal kernel does not support them:

I've checked the systemctl status of patroni.service

root@vm-pg-1:/etc/systemd/system# systemctl status patroni.service
● patroni.service - Runners to orchestrate a high-availability PostgreSQL - Patroni
     Loaded: loaded (/etc/systemd/system/patroni.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-08-22 12:11:38 UTC; 2min 4s ago
   Main PID: 90017 (patroni)
      Tasks: 6 (limit: 38523)
     Memory: 29.8M
        CPU: 935ms
     CGroup: /system.slice/patroni.service
             └─90017 /usr/bin/python3 /usr/bin/patroni /etc/patroni/patroni.yml

Aug 22 12:12:59 vm-pg-1 patroni[90017]: 2024-08-22 12:12:59,376 INFO: Lock owner: None; I am vm-pg-1
Aug 22 12:12:59 vm-pg-1 patroni[90017]: 2024-08-22 12:12:59,377 INFO: waiting for leader to bootstrap
Aug 22 12:13:09 vm-pg-1 patroni[90017]: 2024-08-22 12:13:09,376 INFO: Lock owner: None; I am vm-pg-1

This seems like an odd line since this SHOULD be the master. I've checked with lsof and the port is open

root@vm-pg-1:/etc/systemd/system# lsof -i:8008
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
patroni 90017 postgres    6u  IPv4 261802      0t0  TCP *:8008 (LISTEN)

There is nothing in any of the files in the /var/log/postgresql logs directory. I've checked with ps aux and postgres is not running but patroni is Any suggestions on what else to look at? ( last part of the ansible run below)

TASK [patroni : Start patroni service on the Master server] ****************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.191] => {"changed": true, "enabled": true, "name": "patroni", "state": "started", "status": {"ActiveEnterTimestamp": "n/a", "ActiveEnterTimestampMonotonic": "0", "ActiveExitTimestamp": "n/a", "ActiveExitTimestampMonotonic": "0", "ActiveState": "inactive", "After": "system.slice sysinit.target syslog.target basic.target systemd-journald.socket network.target", "AllowIsolate": "no", "AssertResult": "no", "AssertTimestamp": "n/a", "AssertTimestampMonotonic": "0", "Before": "shutdown.target", "BlockIOAccounting": "no", "BlockIOWeight": "[not set]", "CPUAccounting": "yes", "CPUAffinityFromNUMA": "no", "CPUQuotaPerSecUSec": "infinity", "CPUQuotaPeriodUSec": "infinity", "CPUSchedulingPolicy": "0", "CPUSchedulingPriority": "0", "CPUSchedulingResetOnFork": "no", "CPUShares": "[not set]", "CPUUsageNSec": "[not set]", "CPUWeight": "[not set]", "CacheDirectoryMode": "0755", "CanFreeze": "yes", "CanIsolate": "no", "CanReload": "yes", "CanStart": "yes", "CanStop": "yes", "CapabilityBoundingSet": "cap_chown cap_dac_override cap_dac_read_search cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin cap_net_raw cap_ipc_lock cap_ipc_owner cap_sys_module cap_sys_rawio cap_sys_chroot cap_sys_ptrace cap_sys_pacct cap_sys_admin cap_sys_boot cap_sys_nice cap_sys_resource cap_sys_time cap_sys_tty_config cap_mknod cap_lease cap_audit_write cap_audit_control cap_setfcap cap_mac_override cap_mac_admin cap_syslog cap_wake_alarm cap_block_suspend cap_audit_read cap_perfmon cap_bpf cap_checkpoint_restore", "CleanResult": "success", "CollectMode": "inactive", "ConditionResult": "no", "ConditionTimestamp": "n/a", "ConditionTimestampMonotonic": "0", "ConfigurationDirectoryMode": "0755", "Conflicts": "shutdown.target", "ControlPID": "0", "CoredumpFilter": "0x33", "DefaultDependencies": "yes", "DefaultMemoryLow": "0", "DefaultMemoryMin": "0", "Delegate": "no", "Description": "Runners to orchestrate a high-availability PostgreSQL - Patroni", "DevicePolicy": "auto", "DynamicUser": "no", "EnvironmentFiles": "/etc/patroni_env.conf (ignore_errors=yes)", "ExecMainCode": "0", "ExecMainExitTimestamp": "n/a", "ExecMainExitTimestampMonotonic": "0", "ExecMainPID": "0", "ExecMainStartTimestamp": "n/a", "ExecMainStartTimestampMonotonic": "0", "ExecMainStatus": "0", "ExecReload": "{ path=/bin/kill ; argv[]=/bin/kill -s HUP $MAINPID ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }", "ExecReloadEx": "{ path=/bin/kill ; argv[]=/bin/kill -s HUP $MAINPID ; flags= ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }", "ExecStart": "{ path=/usr/bin/patroni ; argv[]=/usr/bin/patroni /etc/patroni/patroni.yml ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }", "ExecStartEx": "{ path=/usr/bin/patroni ; argv[]=/usr/bin/patroni /etc/patroni/patroni.yml ; flags= ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }", "FailureAction": "none", "FileDescriptorStoreMax": "0", "FinalKillSignal": "9", "FragmentPath": "/etc/systemd/system/patroni.service", "FreezerState": "running", "GID": "[not set]", "Group": "postgres", "GuessMainPID": "yes", "IOAccounting": "no", "IOReadBytes": "18446744073709551615", "IOReadOperations": "18446744073709551615", "IOSchedulingClass": "2", "IOSchedulingPriority": "4", "IOWeight": "[not set]", "IOWriteBytes": "18446744073709551615", "IOWriteOperations": "18446744073709551615", "IPAccounting": "no", "IPEgressBytes": "[no data]", "IPEgressPackets": "[no data]", "IPIngressBytes": "[no data]", "IPIngressPackets": "[no data]", "Id": "patroni.service", "IgnoreOnIsolate": "no", "IgnoreSIGPIPE": "yes", "InactiveEnterTimestamp": "n/a", "InactiveEnterTimestampMonotonic": "0", "InactiveExitTimestamp": "n/a", "InactiveExitTimestampMonotonic": "0", "JobRunningTimeoutUSec": "infinity", "JobTimeoutAction": "none", "JobTimeoutUSec": "infinity", "KeyringMode": "private", "KillMode": "process", "KillSignal": "15", "LimitAS": "infinity", "LimitASSoft": "infinity", "LimitCORE": "infinity", "LimitCORESoft": "0", "LimitCPU": "infinity", "LimitCPUSoft": "infinity", "LimitDATA": "infinity", "LimitDATASoft": "infinity", "LimitFSIZE": "infinity", "LimitFSIZESoft": "infinity", "LimitLOCKS": "infinity", "LimitLOCKSSoft": "infinity", "LimitMEMLOCK": "65536", "LimitMEMLOCKSoft": "65536", "LimitMSGQUEUE": "819200", "LimitMSGQUEUESoft": "819200", "LimitNICE": "0", "LimitNICESoft": "0", "LimitNOFILE": "524288", "LimitNOFILESoft": "1024", "LimitNPROC": "128412", "LimitNPROCSoft": "128412", "LimitRSS": "infinity", "LimitRSSSoft": "infinity", "LimitRTPRIO": "0", "LimitRTPRIOSoft": "0", "LimitRTTIME": "infinity", "LimitRTTIMESoft": "infinity", "LimitSIGPENDING": "128412", "LimitSIGPENDINGSoft": "128412", "LimitSTACK": "infinity", "LimitSTACKSoft": "8388608", "LoadState": "loaded", "LockPersonality": "no", "LogLevelMax": "-1", "LogRateLimitBurst": "0", "LogRateLimitIntervalUSec": "0", "LogsDirectoryMode": "0755", "MainPID": "0", "ManagedOOMMemoryPressure": "auto", "ManagedOOMMemoryPressureLimit": "0", "ManagedOOMPreference": "none", "ManagedOOMSwap": "auto", "MemoryAccounting": "yes", "MemoryAvailable": "infinity", "MemoryCurrent": "[not set]", "MemoryDenyWriteExecute": "no", "MemoryHigh": "infinity", "MemoryLimit": "infinity", "MemoryLow": "0", "MemoryMax": "infinity", "MemoryMin": "0", "MemorySwapMax": "infinity", "MountAPIVFS": "no", "NFileDescriptorStore": "0", "NRestarts": "0", "NUMAPolicy": "n/a", "Names": "patroni.service", "NeedDaemonReload": "no", "Nice": "0", "NoNewPrivileges": "no", "NonBlocking": "no", "NotifyAccess": "none", "OOMPolicy": "stop", "OOMScoreAdjust": "0", "OnFailureJobMode": "replace", "OnSuccessJobMode": "fail", "Perpetual": "no", "PrivateDevices": "no", "PrivateIPC": "no", "PrivateMounts": "no", "PrivateNetwork": "no", "PrivateTmp": "no", "PrivateUsers": "no", "ProcSubset": "all", "ProtectClock": "no", "ProtectControlGroups": "no", "ProtectHome": "no", "ProtectHostname": "no", "ProtectKernelLogs": "no", "ProtectKernelModules": "no", "ProtectKernelTunables": "no", "ProtectProc": "default", "ProtectSystem": "no", "RefuseManualStart": "no", "RefuseManualStop": "no", "ReloadResult": "success", "RemainAfterExit": "no", "RemoveIPC": "no", "Requires": "sysinit.target system.slice", "Restart": "on-failure", "RestartKillSignal": "15", "RestartUSec": "100ms", "RestrictNamespaces": "no", "RestrictRealtime": "no", "RestrictSUIDSGID": "no", "Result": "success", "RootDirectoryStartOnly": "no", "RuntimeDirectoryMode": "0755", "RuntimeDirectoryPreserve": "no", "RuntimeMaxUSec": "infinity", "SameProcessGroup": "no", "SecureBits": "0", "SendSIGHUP": "no", "SendSIGKILL": "yes", "Slice": "system.slice", "StandardError": "inherit", "StandardInput": "null", "StandardOutput": "journal", "StartLimitAction": "none", "StartLimitBurst": "5", "StartLimitIntervalUSec": "10s", "StartupBlockIOWeight": "[not set]", "StartupCPUShares": "[not set]", "StartupCPUWeight": "[not set]", "StartupIOWeight": "[not set]", "StateChangeTimestamp": "n/a", "StateChangeTimestampMonotonic": "0", "StateDirectoryMode": "0755", "StatusErrno": "0", "StopWhenUnneeded": "no", "SubState": "dead", "SuccessAction": "none", "SyslogFacility": "3", "SyslogLevel": "6", "SyslogLevelPrefix": "yes", "SyslogPriority": "30", "SystemCallErrorNumber": "2147483646", "TTYReset": "no", "TTYVHangup": "no", "TTYVTDisallocate": "no", "TasksAccounting": "yes", "TasksCurrent": "[not set]", "TasksMax": "38523", "TimeoutAbortUSec": "1min", "TimeoutCleanUSec": "infinity", "TimeoutStartFailureMode": "terminate", "TimeoutStartUSec": "1min", "TimeoutStopFailureMode": "terminate", "TimeoutStopUSec": "1min", "TimerSlackNSec": "50000", "Transient": "no", "Type": "simple", "UID": "[not set]", "UMask": "0022", "UnitFilePreset": "enabled", "UnitFileState": "disabled", "User": "postgres", "UtmpMode": "init", "WatchdogSignal": "6", "WatchdogTimestamp": "n/a", "WatchdogTimestampMonotonic": "0", "WatchdogUSec": "infinity"}}

TASK [patroni : Wait for port 8008 to become open on the host] *************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.191] => {"changed": false, "elapsed": 10, "match_groupdict": {}, "match_groups": [], "path": null, "port": 8008, "search_regex": null, "state": "started"}
FAILED - RETRYING: [10.0.0.191]: Check PostgreSQL is started and accepting connections on Master (1000 retries left).
FAILED - RETRYING: [10.0.0.191]: Check PostgreSQL is started and accepting connections on Master (999 retries left).
FAILED - RETRYING: [10.0.0.191]: Check PostgreSQL is started and accepting connections on Master (998 retries left).
vitabaks commented 3 weeks ago

Hello @snoby Thanks for the feedback!

will run in aws but i don't want to use their load balancers

Would you describe why?

In cloud environments, there are usually difficulties to organize a VIP (e.q., keepalived, vip-manager), so I added support for a load balancer from a cloud provider.

I worked through a couple of issues that i had with the install, specifically I had to disable the following items as my minimal kernel does not support them:

which version of the distribution and kernel are you using?

I've checked the systemctl status of patroni.service

I am confused by the presence of "waiting for leader to bootstrap" Could you attach the Patroni log?

Are you using a dedicated etcd cluster? Wasn't there already a cluster with that name?

snoby commented 3 weeks ago

Hello @snoby Thanks for the feedback!

will run in aws but i don't want to use their load balancers

Would you describe why? I run the same exact setup in test labs locally on prem, so i don't want to use anything aws specific. We have an alpha, and integration environment and it is quite important the setup is equal to production and since most of what we do needs nothing aws specific ( ie ELB) than we want to keep everything agnostic.

In cloud environments, there are usually difficulties to organize a VIP (e.q., keepalived, vip-manager), so I added support for a load balancer from a cloud provider.

I worked through a couple of issues that i had with the install, specifically I had to disable the following items as my minimal kernel does not support them:

which version of the distribution and kernel are you using? My test environment is running in LXD - that being said i'm running the postgres machines in a vm on LXD. I'm running ubuntu 22.04 release 20240808

uname -a
Linux vm-pg-1 5.15.0-1064-kvm #69-Ubuntu SMP Wed Jul 17 12:14:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"

I've checked the systemctl status of patroni.service

I am confused by the presence of "waiting for leader to bootstrap" Could you attach the Patroni log?

This is all that is reported after restarting the patroni service ( at least all that is in journalctl)

Aug 22 20:38:01 vm-pg-1 patroni[94112]: 2024-08-22 20:38:01,901 INFO: waiting for leader to bootstrap
Aug 22 20:38:11 vm-pg-1 patroni[94112]: 2024-08-22 20:38:11,855 INFO: Lock owner: None; I am vm-pg-1
Aug 22 20:38:11 vm-pg-1 patroni[94112]: 2024-08-22 20:38:11,855 INFO: waiting for leader to bootstrap
Aug 22 20:38:13 vm-pg-1 systemd[1]: Stopping Runners to orchestrate a high-availability PostgreSQL - Patroni...
Aug 22 20:38:13 vm-pg-1 systemd[1]: patroni.service: Deactivated successfully.
Aug 22 20:38:13 vm-pg-1 systemd[1]: Stopped Runners to orchestrate a high-availability PostgreSQL - Patroni.
Aug 22 20:38:13 vm-pg-1 systemd[1]: patroni.service: Consumed 2min 4.833s CPU time.
Aug 22 20:38:13 vm-pg-1 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - Patroni.
Aug 22 20:38:14 vm-pg-1 patroni[151865]: 2024-08-22 20:38:14,094 INFO: Selected new etcd server http://10.0.0.116:2379
Aug 22 20:38:14 vm-pg-1 patroni[151865]: 2024-08-22 20:38:14,150 INFO: No PostgreSQL configuration items changed, nothing to reload.
Aug 22 20:38:14 vm-pg-1 patroni[151865]: 2024-08-22 20:38:14,155 INFO: Lock owner: None; I am vm-pg-1
Aug 22 20:38:14 vm-pg-1 patroni[151865]: 2024-08-22 20:38:14,202 INFO: waiting for leader to bootstrap
Aug 22 20:38:24 vm-pg-1 patroni[151865]: 2024-08-22 20:38:24,156 INFO: Lock owner: None; I am vm-pg-1
Aug 22 20:38:24 vm-pg-1 patroni[151865]: 2024-08-22 20:38:24,201 INFO: waiting for leader to bootstrap
Aug 22 20:38:34 vm-pg-1 patroni[151865]: 2024-08-22 20:38:34,156 INFO: Lock owner: None; I am vm-pg-1
Aug 22 20:38:34 vm-pg-1 patroni[151865]: 2024-08-22 20:38:34,156 INFO: waiting for leader to bootstrap

Are you using a dedicated etcd cluster? Wasn't there already a cluster with that name? I setup new machines for this, the first install failed so i had to run the command to delete the cluster and do a new deploy.

the first attempt at an install failed on my kernel not having kernel.sched_autogroup_enabled, so i commented that out and also the Start disable-transparent-hugepages service failed.

changed: [10.0.0.201] => (item={'name': 'kernel.numa_balancing', 'value': '0'})
failed: [10.0.0.191] (item={'name': 'kernel.sched_autogroup_enabled', 'value': '0'}) => {"ansible_loop_var": "item", "changed": false, "item": {"name": "kernel.sched_autogroup_enabled", "value": "0"}, "msg": "setting kernel.sched_autogroup_enabled failed: sysctl: cannot stat /proc/sys/kernel/sched_autogroup_enabled: No such file or directory\n"}
failed: [10.0.0.201] (item={'name': 'kernel.sched_autogroup_enabled', 'value': '0'}) => {"ansible_loop_var": "item", "changed": false, "item": {"name": "kernel.sched_autogroup_enabled", "value": "0"}, "msg": "setting kernel.sched_autogroup_enabled failed: sysctl: cannot stat /proc/sys/kernel/sched_autogroup_enabled: No such file or directory\n"}
changed: [10.0.0.191] => (item={'name': 'net.ipv4.ip_nonlocal_bind', 'value': '1'})
changed: [10.0.0.201] => (item={'name': 'net.ipv4.ip_nonlocal_bind', 'value': '1'})
changed: [10.0.0.201] => (item={'name': 'net.ipv4.ip_forward', 'value': '1'})
changed: [10.0.0.191] => (item={'name': 'net.ipv4.ip_forward', 'value': '1'})
changed: [10.0.0.201] => (item={'name': 'net.ipv4.ip_local_port_range', 'value': '10000 65535'})
changed: [10.0.0.201] => (item={'name': 'net.core.netdev_max_backlog', 'value': '10000'})
changed: [10.0.0.201] => (item={'name': 'net.ipv4.tcp_max_syn_backlog', 'value': '8192'})
changed: [10.0.0.201] => (item={'name': 'net.core.somaxconn', 'value': '65535'})
changed: [10.0.0.201] => (item={'name': 'net.ipv4.tcp_tw_reuse', 'value': '1'})
...ignoring
changed: [10.0.0.191] => (item={'name': 'net.ipv4.ip_local_port_range', 'value': '10000 65535'})
changed: [10.0.0.191] => (item={'name': 'net.core.netdev_max_backlog', 'value': '10000'})
changed: [10.0.0.191] => (item={'name': 'net.ipv4.tcp_max_syn_backlog', 'value': '8192'})
changed: [10.0.0.191] => (item={'name': 'net.core.somaxconn', 'value': '65535'})
changed: [10.0.0.191] => (item={'name': 'net.ipv4.tcp_tw_reuse', 'value': '1'})
...ignoring

TASK [transparent_huge_pages : Create systemd service "disable-transparent-huge-pages.service"] ****************************************************************************************************************************************************************************************************
changed: [10.0.0.191]
changed: [10.0.0.201]

RUNNING HANDLER [transparent_huge_pages : Start disable-transparent-huge-pages service] ************************************************************************************************************************************************************************************************************
fatal: [10.0.0.201]: FAILED! => {"changed": false, "msg": "Unable to start service disable-transparent-huge-pages: Job for disable-transparent-huge-pages.service failed because the control process exited with error code.\nSee \"systemctl status disable-transparent-huge-pages.service\" and \"journalctl -xeu disable-transparent-huge-pages.service\" for details.\n"}
fatal: [10.0.0.191]: FAILED! => {"changed": false, "msg": "Unable to start service disable-transparent-huge-pages: Job for disable-transparent-huge-pages.service failed because the control process exited with error code.\nSee \"systemctl status disable-transparent-huge-pages.service\" and \"journalctl -xeu disable-transparent-huge-pages.service\" for details.\n"}

on the next run the pgbouncer/config failed

TASK [pgbouncer/config : Create 'user_search' function for pgbouncer 'auth_query' option] **********************************************************************************************************************************************************************************************************
fatal: [10.0.0.191]: FAILED! => {"changed": true, "cmd": ["/usr/lib/postgresql/16/bin/psql", "-p", "5432", "-U", "postgres", "-d", "postgres", "-tAXc", "CREATE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename name, passwd text) AS $$ SELECT usename, passwd FROM pg_shadow WHERE usename=$1; $$ LANGUAGE sql SECURITY DEFINER; REVOKE ALL ON FUNCTION user_search(uname TEXT) FROM public; GRANT EXECUTE ON FUNCTION user_search(uname TEXT) TO pgbouncer"], "delta": "0:00:00.033299", "end": "2024-08-22 04:41:04.426189", "msg": "non-zero return code", "rc": 1, "start": "2024-08-22 04:41:04.392890", "stderr": "ERROR:  role \"pgbouncer\" does not exist", "stderr_lines": ["ERROR:  role \"pgbouncer\" does not exist"], "stdout": "CREATE FUNCTION\nREVOKE", "stdout_lines": ["CREATE FUNCTION", "REVOKE"]}

Then after another delete and create cluster here we are.

vitabaks commented 3 weeks ago

role \"pgbouncer\" does not exist"

Please attach vars directory

snoby commented 3 weeks ago

vars.zip

vitabaks commented 3 weeks ago

I see that you manually specified "vm.nr_hugepages" for about 8GB (although it is automated) while you left the automatic setting for shared_buffers.

Please tell me what is the size of RAM on the server? Since the setting for 25% of the available memory is enabled here, there is a risk that you have specified an insufficient number of "vm.nr_hugepages".

vitabaks commented 3 weeks ago

There is nothing in any of the files in the /var/log/postgresql logs directory.

It's weird, there should be log files in there. Their absence makes diagnosis difficult.

Try to manually start Postgres.

snoby commented 3 weeks ago

I commented out of vars/system.yml the shared memory value ( which was 8GB). The system in total has 32 GB of ram. I had a huge entry detailing my debugging steps and then i finally just said... "Start from scratch" and started over again. What's interesting on the reinstall it fails in the transparent huge page disablement

Aug 23 13:26:49 vm-pg-1 systemd[1]: update-notifier-download.service: Deactivated successfully.
Aug 23 13:26:49 vm-pg-1 systemd[1]: Finished Download data for packages that failed at package install time.
Aug 23 13:26:49 vm-pg-1 systemd[1]: Starting Disable Transparent Huge Pages...
Aug 23 13:26:49 vm-pg-1 bash[7695]: /bin/bash: line 1: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory
Aug 23 13:26:49 vm-pg-1 systemd[1]: disable-transparent-huge-pages.service: Main process exited, code=exited, status=1/FAILURE
Aug 23 13:26:49 vm-pg-1 systemd[1]: disable-transparent-huge-pages.service: Failed with result 'exit-code'.
Aug 23 13:26:49 vm-pg-1 systemd[1]: Failed to start Disable Transparent Huge Pages.

It makes sense that would not be there, i just commented out that install line from the playbook.

But i'm still at the last where the playbook is waiting on 8008

root@vm-pg-1:/var/log/postgresql# ll
total 12
drwx------  2 postgres postgres 4096 Aug 23 13:40 ./
drwxrwxr-x 11 root     syslog   4096 Aug 23 13:39 ../
-rw-r--r--  1 postgres postgres  640 Aug 23 13:39 pgbouncer.log
-rw-r-----  1 postgres postgres    0 Aug 23 13:40 postgresql-16-main.log
root@vm-pg-1:/var/log/postgresql# cat pgbouncer.log
2024-08-23 13:39:45.872 UTC [9330] LOG kernel file descriptor limit: 1024 (hard: 524288); max_client_conn: 100, max expected fd use: 112
2024-08-23 13:39:45.873 UTC [9330] LOG listening on 127.0.0.1:6432
2024-08-23 13:39:45.874 UTC [9330] LOG listening on unix:/var/run/postgresql/.s.PGSQL.6432
2024-08-23 13:39:45.874 UTC [9330] LOG process up: PgBouncer 1.23.1, libevent 2.1.12-stable (epoll), adns: c-ares 1.18.1, tls: OpenSSL 3.0.2 15 Mar 2022
2024-08-23 13:39:49.907 UTC [9330] LOG got SIGINT, shutting down, waiting for all servers connections to be released
2024-08-23 13:39:50.206 UTC [9330] LOG server connections dropped, exiting

hacking around with some commands

root@vm-pg-1:/etc/postgresql/16/main# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: postgres-cluster (7405819144232212623) ----------+
| Member  | Host       | Role    | State   | TL | Lag in MB |
+---------+------------+---------+---------+----+-----------+
| vm-pg-1 | 10.0.0.102 | Replica | stopped |    |   unknown |
+---------+------------+---------+---------+----+-----------+

strange that it thinks it's a replica.

mrcloudbook commented 2 weeks ago
Screenshot 2024-08-26 at 2 01 31 PM

We are facing same issue while adding new node

vitabaks commented 2 weeks ago

here, without logs, it is impossible to understand why the database does not start.

Try to deploy the cluster according to this instruction (without manual changes): https://postgresql-cluster.org/deployment/aws

vitabaks commented 1 week ago

Is the problem still relevant? Was it possible to deploy the cluster?

vitabaks commented 1 week ago

A similar (solved) issue: https://github.com/vitabaks/postgresql_cluster/issues/747