vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.29k stars 352 forks source link

Wait for port 8008 to become open on the host #396

Closed fatmaAliGamal closed 8 months ago

fatmaAliGamal commented 10 months ago
TASK [patroni : Start patroni service on the Master server] **********************************************************************
changed: [10.0.4.61]

TASK [patroni : Wait for port 8008 to become open on the host] *******************************************************************
ok: [10.0.4.61]

TASK [patroni : Check PostgreSQL is started and accepting connections on Master] *************************************************
ok: [10.0.4.61]

TASK [patroni : Wait for the cluster to initialize (master is the leader with the lock)] *****************************************
ok: [10.0.4.61]

TASK [patroni : Prepare PostgreSQL | generate pg_hba.conf] ***********************************************************************
changed: [10.0.5.43]
changed: [10.0.4.61]
changed: [10.0.4.201]

TASK [patroni : Prepare PostgreSQL | reload for apply the pg_hba.conf] ***********************************************************
ok: [10.0.4.201]
ok: [10.0.5.43]
changed: [10.0.4.61]

TASK [patroni : Start patroni service on Replica servers] ************************************************************************
changed: [10.0.4.201]
changed: [10.0.5.43]

TASK [patroni : Wait for port 8008 to become open on the host] *******************************************************************
fatal: [10.0.5.43]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 10.0.5.43:8008"}
fatal: [10.0.4.201]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 10.0.4.201:8008"}

when I use pgbackrest

vitabaks commented 10 months ago

Check Patroni log on replicas

sudo journalctl -u patroni

fatmaAliGamal commented 10 months ago
Starting Runners to orchestrate a high-availability PostgreSQL - Patroni...
Jul 06 05:08:22 ip-10-0-5-31 sudo[14508]: postgres : PWD=/ ; USER=root ; COMMAND=/sbin/modprobe softdog
Jul 06 05:08:22 ip-10-0-5-31 sudo[14508]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:22 ip-10-0-5-31 sudo[14508]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:22 ip-10-0-5-31 sudo[14510]: postgres : PWD=/ ; USER=root ; COMMAND=/bin/chown postgres /dev/watchdog
Jul 06 05:08:22 ip-10-0-5-31 sudo[14510]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:22 ip-10-0-5-31 sudo[14510]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:22 ip-10-0-5-31 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:23 ip-10-0-5-31 patroni[14512]: 2023-07-06 05:08:23,253 INFO: Selected new etcd server http://10.0.4.57:2379
Jul 06 05:08:23 ip-10-0-5-31 patroni[14512]: 2023-07-06 05:08:23,262 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 06 05:08:23 ip-10-0-5-31 patroni[14512]: 2023-07-06 05:08:23,313 CRITICAL: system ID mismatch, node ip-10-0-5-31 belongs to a different cluster: 7252567923797276858 != 7252567842164459521
Jul 06 05:08:23 ip-10-0-5-31 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 06 05:08:23 ip-10-0-5-31 systemd[1]: patroni.service: Failed with result 'exit-code'.
Jul 06 05:08:24 ip-10-0-5-31 systemd[1]: patroni.service: Scheduled restart job, restart counter is at 1.
Jul 06 05:08:24 ip-10-0-5-31 systemd[1]: Stopped Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:24 ip-10-0-5-31 systemd[1]: Starting Runners to orchestrate a high-availability PostgreSQL - Patroni...
Jul 06 05:08:24 ip-10-0-5-31 sudo[14537]: postgres : PWD=/ ; USER=root ; COMMAND=/sbin/modprobe softdog
Jul 06 05:08:24 ip-10-0-5-31 sudo[14537]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:24 ip-10-0-5-31 sudo[14537]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:24 ip-10-0-5-31 sudo[14539]: postgres : PWD=/ ; USER=root ; COMMAND=/bin/chown postgres /dev/watchdog
Jul 06 05:08:24 ip-10-0-5-31 sudo[14539]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:24 ip-10-0-5-31 sudo[14539]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:24 ip-10-0-5-31 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:24 ip-10-0-5-31 patroni[14541]: 2023-07-06 05:08:24,799 INFO: Selected new etcd server http://10.0.4.57:2379
Jul 06 05:08:24 ip-10-0-5-31 patroni[14541]: 2023-07-06 05:08:24,809 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 06 05:08:24 ip-10-0-5-31 patroni[14541]: 2023-07-06 05:08:24,813 CRITICAL: system ID mismatch, node ip-10-0-5-31 belongs to a different cluster: 7252567923797276858 != 7252567842164459521
Jul 06 05:08:25 ip-10-0-5-31 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 06 05:08:25 ip-10-0-5-31 systemd[1]: patroni.service: Failed with result 'exit-code'.
Jul 06 05:08:25 ip-10-0-5-31 systemd[1]: patroni.service: Scheduled restart job, restart counter is at 2.
Jul 06 05:08:25 ip-10-0-5-31 systemd[1]: Stopped Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:25 ip-10-0-5-31 systemd[1]: Starting Runners to orchestrate a high-availability PostgreSQL - Patroni...
Jul 06 05:08:25 ip-10-0-5-31 sudo[14550]: postgres : PWD=/ ; USER=root ; COMMAND=/sbin/modprobe softdog
Jul 06 05:08:25 ip-10-0-5-31 sudo[14550]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:25 ip-10-0-5-31 sudo[14550]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:25 ip-10-0-5-31 sudo[14552]: postgres : PWD=/ ; USER=root ; COMMAND=/bin/chown postgres /dev/watchdog
Jul 06 05:08:25 ip-10-0-5-31 sudo[14552]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:25 ip-10-0-5-31 sudo[14552]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:25 ip-10-0-5-31 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:26 ip-10-0-5-31 patroni[14554]: 2023-07-06 05:08:26,307 INFO: Selected new etcd server http://10.0.4.57:2379
Jul 06 05:08:26 ip-10-0-5-31 patroni[14554]: 2023-07-06 05:08:26,317 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 06 05:08:26 ip-10-0-5-31 patroni[14554]: 2023-07-06 05:08:26,321 CRITICAL: system ID mismatch, node ip-10-0-5-31 belongs to a different cluster: 7252567923797276858 != 7252567842164459521
Jul 06 05:08:26 ip-10-0-5-31 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 06 05:08:26 ip-10-0-5-31 systemd[1]: patroni.service: Failed with result 'exit-code'.
Jul 06 05:08:27 ip-10-0-5-31 systemd[1]: patroni.service: Scheduled restart job, restart counter is at 3.
Jul 06 05:08:27 ip-10-0-5-31 systemd[1]: Stopped Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:27 ip-10-0-5-31 systemd[1]: Starting Runners to orchestrate a high-availability PostgreSQL - Patroni...
Jul 06 05:08:27 ip-10-0-5-31 sudo[14571]: postgres : PWD=/ ; USER=root ; COMMAND=/sbin/modprobe softdog
Jul 06 05:08:27 ip-10-0-5-31 sudo[14571]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:27 ip-10-0-5-31 sudo[14571]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:27 ip-10-0-5-31 sudo[14573]: postgres : PWD=/ ; USER=root ; COMMAND=/bin/chown postgres /dev/watchdog
Jul 06 05:08:27 ip-10-0-5-31 sudo[14573]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:27 ip-10-0-5-31 sudo[14573]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:27 ip-10-0-5-31 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:27 ip-10-0-5-31 patroni[14575]: 2023-07-06 05:08:27,799 INFO: Selected new etcd server http://10.0.5.31:2379
Jul 06 05:08:27 ip-10-0-5-31 patroni[14575]: 2023-07-06 05:08:27,808 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 06 05:08:27 ip-10-0-5-31 patroni[14575]: 2023-07-06 05:08:27,813 CRITICAL: system ID mismatch, node ip-10-0-5-31 belongs to a different cluster: 7252567923797276858 != 7252567842164459521
Jul 06 05:08:28 ip-10-0-5-31 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 06 05:08:28 ip-10-0-5-31 systemd[1]: patroni.service: Failed with result 'exit-code'.
Jul 06 05:08:28 ip-10-0-5-31 systemd[1]: patroni.service: Scheduled restart job, restart counter is at 4.
Jul 06 05:08:28 ip-10-0-5-31 systemd[1]: Stopped Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:28 ip-10-0-5-31 systemd[1]: Starting Runners to orchestrate a high-availability PostgreSQL - Patroni...
Jul 06 05:08:28 ip-10-0-5-31 sudo[14583]: postgres : PWD=/ ; USER=root ; COMMAND=/sbin/modprobe softdog
Jul 06 05:08:28 ip-10-0-5-31 sudo[14583]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:28 ip-10-0-5-31 sudo[14583]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:28 ip-10-0-5-31 sudo[14585]: postgres : PWD=/ ; USER=root ; COMMAND=/bin/chown postgres /dev/watchdog
Jul 06 05:08:28 ip-10-0-5-31 sudo[14585]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=115)
Jul 06 05:08:28 ip-10-0-5-31 sudo[14585]: pam_unix(sudo:session): session closed for user root
Jul 06 05:08:28 ip-10-0-5-31 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:29 ip-10-0-5-31 patroni[14587]: 2023-07-06 05:08:29,299 INFO: Selected new etcd server http://10.0.5.31:2379
Jul 06 05:08:29 ip-10-0-5-31 patroni[14587]: 2023-07-06 05:08:29,308 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 06 05:08:29 ip-10-0-5-31 patroni[14587]: 2023-07-06 05:08:29,313 CRITICAL: system ID mismatch, node ip-10-0-5-31 belongs to a different cluster: 7252567923797276858 != 7252567842164459521
Jul 06 05:08:29 ip-10-0-5-31 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 06 05:08:29 ip-10-0-5-31 systemd[1]: patroni.service: Failed with result 'exit-code'.
Jul 06 05:08:30 ip-10-0-5-31 systemd[1]: patroni.service: Scheduled restart job, restart counter is at 5.
Jul 06 05:08:30 ip-10-0-5-31 systemd[1]: Stopped Runners to orchestrate a high-availability PostgreSQL - Patroni.
Jul 06 05:08:30 ip-10-0-5-31 systemd[1]: patroni.service: Start request repeated too quickly.
Jul 06 05:08:30 ip-10-0-5-31 systemd[1]: patroni.service: Failed with result 'exit-code'.
Jul 06 05:08:30 ip-10-0-5-31 systemd[1]: Failed to start Runners to orchestrate a high-availability PostgreSQL - Patroni.
hosts.ini
[etcd_cluster]
10.0.4.134 ansible_host=44.211.202.238
10.0.4.57 ansible_host=3.90.69.251
10.0.5.31 ansible_host=54.160.162.229
[balancers]
10.0.4.134 ansible_host=44.211.202.238
10.0.4.57 ansible_host=3.90.69.251
10.0.5.31 ansible_host=54.160.162.229
[pgbackrest]
10.0.4.134 ansible_host=44.211.202.238
10.0.4.57 ansible_host=3.90.69.251
10.0.5.31 ansible_host=54.160.162.229
[master]
10.0.4.134 ansible_host=44.211.202.238
[replica]
10.0.4.57 ansible_host=3.90.69.251
10.0.5.31 ansible_host=54.160.162.229
[postgres_cluster:children]
master
replica
[all:vars]
ansible_connection='ssh'
ansible_ssh_port='22'
ansible_ssh_user=ubuntu
ansible_ssh_private_key_file=./postgres-db-key
[pgbackrest:vars]
ansible_connection='ssh'
ansible_ssh_port='22'
ansible_ssh_user=ubuntu
ansible_ssh_private_key_file=./postgres-db-key

at main.yml

---
# ---------------------------------------------------------------------
# Proxy variables (optional) for download packages using a proxy server
proxy_env: {}  # yamllint disable rule:braces
#  http_proxy: http://10.128.64.9:3128
#  https_proxy: http://10.128.64.9:3128
# ---------------------------------------------------------------------

# Cluster variables
cluster_vip: ""  # ip address for client access to databases in the cluster (optional)
vip_interface: "{{ ansible_default_ipv4.interface }}"  # interface name (ex. "ens32")

patroni_cluster_name: "postgres-cluster"  # the cluster name (must be unique for each cluster)
patroni_install_version: "latest"  # or specific version (example 1.5.6)

patroni_superuser_username: "postgres"
patroni_superuser_password: "postgres-pass"  # please change password
patroni_replication_username: "replicator"
patroni_replication_password: "replicator-pass"  # please change password

synchronous_mode: true  # or 'true' for enable synchronous database replication
synchronous_mode_strict: true  # if 'true' then block all client writes to the master, when a synchronous replica is not available
synchronous_node_count: 1  # number of synchronous standby databases

# Load Balancing
with_haproxy_load_balancing: false  # or 'true' if you want to install and configure the load-balancing
haproxy_listen_port:
  master: 5000
  replicas: 5001
  replicas_sync: 5002
  replicas_async: 5003
  stats: 7000
haproxy_maxconn:
  global: 100000
  master: 10000
  replica: 10000
haproxy_timeout:
  client: "60m"
  server: "60m"

# vip-manager (if cluster_vip is specified and with_haproxy_load_balancing: false)
vip_manager_version: "2.1.0"  # version to install
vip_manager_conf: "/etc/patroni/vip-manager.yml"
vip_manager_interval: "1000"  # time (in milliseconds) after which vip-manager wakes up and checks if it needs to register or release ip addresses.
vip_manager_iface: "{{ vip_interface }}"  # interface to which the virtual ip will be added
vip_manager_ip: "{{ cluster_vip }}"  # the virtual ip address to manage
vip_manager_mask: "24"  # netmask for the virtual ip

# DCS (Distributed Consensus Store)
dcs_exists: false  # or 'true' if you don't want to deploy a new etcd cluster
dcs_type: "etcd"  # or 'consul'

# if dcs_type: "etcd" and dcs_exists: false
etcd_version: "3.5.9"  # version for deploy etcd cluster
etcd_data_dir: "/var/lib/etcd"
etcd_cluster_name: "etcd-{{ patroni_cluster_name }}"  # ETCD_INITIAL_CLUSTER_TOKEN

# if dcs_type: "etcd" and dcs_exists: true
patroni_etcd_hosts: []  # list of servers of an existing etcd cluster
#  - { host: "10.128.64.140", port: "2379" }
#  - { host: "10.128.64.142", port: "2379" }
#  - { host: "10.128.64.143", port: "2379" }

# more options you can specify in the roles/patroni/templates/patroni.yml.j2
# https://patroni.readthedocs.io/en/latest/SETTINGS.html#etcd
# https://patroni.readthedocs.io/en/latest/SETTINGS.html#consul

# if dcs_type: "consul"
consul_version: "1.14.7"
consul_config_path: "/etc/consul"
consul_configd_path: "{{ consul_config_path }}/conf.d"
consul_data_path: "/var/lib/consul"
consul_domain: "consul"  # Consul domain name
consul_datacenter: "dc1"  # Datacenter label (can be specified for each host in the inventory)
consul_disable_update_check: true  # Disables automatic checking for security bulletins and new version releases
consul_enable_script_checks: true  # This controls whether health checks that execute scripts are enabled on this agent
consul_enable_local_script_checks: true  # Enable them when they are defined in the local configuration files
consul_ui: false  # Enable the consul UI?
consul_syslog_enable: true  # Enable logging to syslog
consul_iface: "{{ ansible_default_ipv4.interface }}"  # specify the interface name with a Private IP (ex. "enp7s0")
# TLS
# You can enable TLS encryption by dropping a CA certificate, server certificate, and server key in roles/consul/files/
consul_tls_enable: false
consul_tls_ca_crt: "ca.crt"
consul_tls_server_crt: "server.crt"
consul_tls_server_key: "server.key"
# DNS
consul_recursors: []  # List of upstream DNS servers
consul_dnsmasq_enable: true  # Enable DNS forwarding with Dnsmasq
consul_dnsmasq_cache: 0  # dnsmasq cache-size (0 - disable caching)
consul_dnsmasq_servers:  # Upstream DNS servers used by dnsmasq
  - "8.8.8.8"
  - "9.9.9.9"
consul_join: []  # List of LAN servers of an existing consul cluster, to join.
# - "10.128.64.140"
# - "10.128.64.142"
# - "10.128.64.143"

# https://developer.hashicorp.com/consul/docs/discovery/services
consul_services:
  - name: "{{ patroni_cluster_name }}"
    id: "{{ patroni_cluster_name }}-master"
    tags: ['master']
    port: "{{ pgbouncer_listen_port }}"  # or "{{ postgresql_port }}" if pgbouncer_install: false
    checks:
      - { http: "http://{{ inventory_hostname }}:{{ patroni_restapi_port }}/master", interval: "2s" }
      - { args: ["systemctl", "status", "pgbouncer"], interval: "5s" }  # comment out this check if pgbouncer_install: false
  - name: "{{ patroni_cluster_name }}"
    id: "{{ patroni_cluster_name }}-replica"
    tags: ['replica']
    port: "{{ pgbouncer_listen_port }}"
    checks:
      - { http: "http://{{ inventory_hostname }}:{{ patroni_restapi_port }}/replica", interval: "2s" }
      - { args: ["systemctl", "status", "pgbouncer"], interval: "5s" }
#  - name: "{{ patroni_cluster_name }}"
#    id: "{{ patroni_cluster_name }}-sync-replica"
#    tags: ['sync-replica']
#    port: "{{ pgbouncer_listen_port }}"
#    checks:
#      - { http: "http://{{ inventory_hostname }}:{{ patroni_restapi_port }}/sync", interval: "2s" }
#      - { args: ["systemctl", "status", "pgbouncer"], interval: "5s" }
#  - name: "{{ patroni_cluster_name }}"
#    id: "{{ patroni_cluster_name }}-async-replica"
#    tags: ['async-replica']
#    port: "{{ pgbouncer_listen_port }}"
#    checks:
#      - { http: "http://{{ inventory_hostname }}:{{ patroni_restapi_port }}/async", interval: "2s" }
#      - { args: ["systemctl", "status", "pgbouncer"], interval: "5s" }

# PostgreSQL variables
postgresql_version: "15"
# postgresql_data_dir: see vars/Debian.yml or vars/RedHat.yml
postgresql_port: "5432"
postgresql_encoding: "UTF8"  # for bootstrap only (initdb)
postgresql_locale: "en_US.UTF-8"  # for bootstrap only (initdb)
postgresql_data_checksums: true  # for bootstrap only (initdb)
postgresql_password_encryption_algorithm: "scram-sha-256"  # or "md5" if your clients do not work with passwords encrypted with SCRAM-SHA-256

# (optional) list of users to be created (if not already exists)
postgresql_users: []
#  - { name: "mydb-user", password: "mydb-user-pass", flags: "SUPERUSER" }
#  - { name: "", password: "", flags: "NOSUPERUSER" }
#  - { name: "", password: "", flags: "NOSUPERUSER" }
#  - { name: "", password: "", flags: "NOLOGIN" }

# (optional) list of databases to be created (if not already exists)
postgresql_databases: []
#  - { db: "mydatabase", encoding: "UTF8", lc_collate: "ru_RU.UTF-8", lc_ctype: "ru_RU.UTF-8", owner: "mydb-user" }
#  - { db: "mydatabase2", encoding: "UTF8", lc_collate: "ru_RU.UTF-8", lc_ctype: "ru_RU.UTF-8", owner: "mydb-user", conn_limit: "50" }
#  - { db: "", encoding: "UTF8", lc_collate: "en_US.UTF-8", lc_ctype: "en_US.UTF-8", owner: "" }
#  - { db: "", encoding: "UTF8", lc_collate: "en_US.UTF-8", lc_ctype: "en_US.UTF-8", owner: "" }

# (optional) list of schemas to be created (if not already exists)
postgresql_schemas: []
#  - { schema: "myschema", db: "mydatabase", owner: "mydb-user" }

# (optional) list of database extensions to be created (if not already exists)
postgresql_extensions: []
#  - { ext: "pg_stat_statements", db: "postgres" }
#  - { ext: "pg_stat_statements", db: "mydatabase" }
#  - { ext: "pg_stat_statements", db: "mydatabase", schema: "myschema" }
#  - { ext: "pg_stat_statements", db: "" }
#  - { ext: "pg_stat_statements", db: "" }
#  - { ext: "pg_repack", db: "" }  # postgresql-<version>-repack package is required
#  - { ext: "pg_stat_kcache", db: "" }  # postgresql-<version>-pg-stat-kcache package is required
#  - { ext: "", db: "" }
#  - { ext: "", db: "" }

# postgresql parameters to bootstrap dcs (are parameters for example)
postgresql_parameters:
  - { option: "max_connections", value: "500" }
  - { option: "superuser_reserved_connections", value: "5" }
  - { option: "password_encryption", value: "{{ postgresql_password_encryption_algorithm }}" }
  - { option: "max_locks_per_transaction", value: "512" }
  - { option: "max_prepared_transactions", value: "0" }
  - { option: "huge_pages", value: "try" }  # or "on" if you set "vm_nr_hugepages" in kernel parameters
  - { option: "shared_buffers", value: "{{ (ansible_memtotal_mb * 0.25)|int }}MB" }  # by default, 25% of RAM
  - { option: "effective_cache_size", value: "{{ (ansible_memtotal_mb * 0.75)|int }}MB" }  # by default, 75% of RAM
  - { option: "work_mem", value: "128MB" }  # please change this value
  - { option: "maintenance_work_mem", value: "256MB" }  # please change this value
  - { option: "checkpoint_timeout", value: "15min" }
  - { option: "checkpoint_completion_target", value: "0.9" }
  - { option: "min_wal_size", value: "2GB" }
  - { option: "max_wal_size", value: "8GB" }  # or 16GB/32GB
  - { option: "wal_buffers", value: "32MB" }
  - { option: "default_statistics_target", value: "1000" }
  - { option: "seq_page_cost", value: "1" }
  - { option: "random_page_cost", value: "4" }  # "1.1" for SSD storage. Also, if your databases fits in shared_buffers
  - { option: "effective_io_concurrency", value: "2" }  # "200" for SSD storage
  - { option: "synchronous_commit", value: "on" }  # or 'off' if you can you lose single transactions in case of a crash
  - { option: "autovacuum", value: "on" }  # never turn off the autovacuum!
  - { option: "autovacuum_max_workers", value: "5" }
  - { option: "autovacuum_vacuum_scale_factor", value: "0.01" }  # or 0.005/0.001
  - { option: "autovacuum_analyze_scale_factor", value: "0.01" }
  - { option: "autovacuum_vacuum_cost_limit", value: "500" }  # or 1000/5000
  - { option: "autovacuum_vacuum_cost_delay", value: "2" }
  - { option: "autovacuum_naptime", value: "1s" }
  - { option: "max_files_per_process", value: "4096" }
  - { option: "archive_mode", value: "on" }
  - { option: "archive_timeout", value: "1800s" }
  # - { option: "archive_command", value: "cd ." }  # not doing anything yet with WAL-s
  # - { option: "archive_command", value: "{{ wal_g_archive_command }}" }  # archive WAL-s using WAL-G
  - { option: "archive_command", value: "{{ pgbackrest_archive_command }}" }  # archive WAL-s using pgbackrest
  - { option: "wal_level", value: "replica" }
  - { option: "wal_keep_size", value: "2GB" }
  - { option: "max_wal_senders", value: "10" }
  - { option: "max_replication_slots", value: "10" }
  - { option: "hot_standby", value: "on" }
  - { option: "wal_log_hints", value: "on" }
  - { option: "wal_compression", value: "on" }
  - { option: "shared_preload_libraries", value: "pg_stat_statements,auto_explain" }
  - { option: "pg_stat_statements.max", value: "10000" }
  - { option: "pg_stat_statements.track", value: "all" }
  - { option: "pg_stat_statements.track_utility", value: "false" }
  - { option: "pg_stat_statements.save", value: "true" }
  - { option: "auto_explain.log_min_duration", value: "10s" }  # enable auto_explain for 10-second logging threshold. Decrease this value if necessary
  - { option: "auto_explain.log_analyze", value: "true" }
  - { option: "auto_explain.log_buffers", value: "true" }
  - { option: "auto_explain.log_timing", value: "false" }
  - { option: "auto_explain.log_triggers", value: "true" }
  - { option: "auto_explain.log_verbose", value: "true" }
  - { option: "auto_explain.log_nested_statements", value: "true" }
  - { option: "auto_explain.sample_rate", value: "0.01" }  # enable auto_explain for 1% of queries logging threshold
  - { option: "track_io_timing", value: "on" }
  - { option: "log_lock_waits", value: "on" }
  - { option: "log_temp_files", value: "0" }
  - { option: "track_activities", value: "on" }
  - { option: "track_activity_query_size", value: "4096" }
  - { option: "track_counts", value: "on" }
  - { option: "track_functions", value: "all" }
  - { option: "log_checkpoints", value: "on" }
  - { option: "logging_collector", value: "on" }
  - { option: "log_truncate_on_rotation", value: "on" }
  - { option: "log_rotation_age", value: "1d" }
  - { option: "log_rotation_size", value: "0" }
  - { option: "log_line_prefix", value: "'%t [%p-%l] %r %q%u@%d '" }
  - { option: "log_filename", value: "postgresql-%a.log" }
  - { option: "log_directory", value: "{{ postgresql_log_dir }}" }
  - { option: "hot_standby_feedback", value: "on" }  # allows feedback from a hot standby to the primary that will avoid query conflicts
  - { option: "max_standby_streaming_delay", value: "30s" }
  - { option: "wal_receiver_status_interval", value: "10s" }
  - { option: "idle_in_transaction_session_timeout", value: "10min" }  # reduce this timeout if possible
  - { option: "jit", value: "off" }
  - { option: "max_worker_processes", value: "24" }
  - { option: "max_parallel_workers", value: "8" }
  - { option: "max_parallel_workers_per_gather", value: "2" }
  - { option: "max_parallel_maintenance_workers", value: "2" }
  - { option: "tcp_keepalives_count", value: "10" }
  - { option: "tcp_keepalives_idle", value: "300" }
  - { option: "tcp_keepalives_interval", value: "30" }
#  - { option: "old_snapshot_threshold", value: "60min" }
#  - { option: "", value: "" }
#  - { option: "", value: "" }

# Set this variable to 'true' if you want the cluster to be automatically restarted
# after changing the 'postgresql_parameters' variable that requires a restart in the 'config_pgcluster.yml' playbook.
# By default, the cluster will not be automatically restarted.
pending_restart: false

# specify additional hosts that will be added to the pg_hba.conf
postgresql_pg_hba:
  - { type: "local", database: "all", user: "{{ patroni_superuser_username }}", address: "", method: "trust" }
  - { type: "local", database: "replication", user: "{{ patroni_superuser_username }}", address: "", method: "trust" }
  - { type: "local", database: "all", user: "all", address: "", method: "peer" }
  - { type: "host", database: "all", user: "all", address: "127.0.0.1/32", method: "{{ postgresql_password_encryption_algorithm }}" }
  - { type: "host", database: "all", user: "all", address: "::1/128", method: "{{ postgresql_password_encryption_algorithm }}" }
#  - { type: "host", database: "mydatabase", user: "mydb-user", address: "192.168.0.0/24", method: "{{ postgresql_password_encryption_algorithm }}" }
#  - { type: "host", database: "all", user: "all", address: "192.168.0.0/24", method: "ident", options: "map=main" }  # use pg_ident

# list of lines that Patroni will use to generate pg_ident.conf
postgresql_pg_ident: []
#  - { mapname: "main", system_username: "postgres", pg_username: "backup" }
#  - { mapname: "", system_username: "", pg_username: "" }

# the password file (~/.pgpass)
postgresql_pgpass:
  - "localhost:{{ postgresql_port }}:*:{{ patroni_superuser_username }}:{{ patroni_superuser_password }}"
  - "{{ inventory_hostname }}:{{ postgresql_port }}:*:{{ patroni_superuser_username }}:{{ patroni_superuser_password }}"
#  - hostname:port:database:username:password

# PgBouncer parameters
pgbouncer_install: true  # or 'false' if you do not want to install and configure the pgbouncer service
pgbouncer_conf_dir: "/etc/pgbouncer"
pgbouncer_log_dir: "/var/log/pgbouncer"
pgbouncer_listen_addr: "0.0.0.0"
pgbouncer_listen_port: 6432
pgbouncer_max_client_conn: 10000
pgbouncer_max_db_connections: 1000
pgbouncer_default_pool_size: 20
pgbouncer_query_wait_timeout: 120
pgbouncer_default_pool_mode: "session"
pgbouncer_admin_users: "postgres"  # comma-separated list of users, who are allowed to change settings
pgbouncer_stats_users: "postgres"  # comma-separated list of users who are just allowed to use SHOW command
pgbouncer_generate_userlist: true  # generate the authentication file (userlist.txt) from the pg_shadow system table
pgbouncer_auth_type: "{{ postgresql_password_encryption_algorithm }}"
pgbouncer_ignore_startup_parameters: "extra_float_digits,geqo,search_path"

pgbouncer_pools:
  - { name: "postgres", dbname: "postgres", pool_parameters: "" }
#  - { name: "mydatabase", dbname: "mydatabase", pool_parameters: "pool_size=20 pool_mode=transaction" }
#  - { name: "", dbname: "", pool_parameters: "" }
#  - { name: "", dbname: "", pool_parameters: "" }

# Extended variables (optional)
patroni_restapi_port: 8008
patroni_ttl: 30
patroni_loop_wait: 10
patroni_retry_timeout: 10
patroni_maximum_lag_on_failover: 1048576
patroni_master_start_timeout: 300

# https://patroni.readthedocs.io/en/latest/SETTINGS.html?highlight=callbacks#dynamic-configuration-settings
patroni_callbacks: []
#  - {action: "on_role_change", script: ""}
#  - {action: "on_stop", script: ""}
#  - {action: "on_restart", script: ""}
#  - {action: "on_reload", script: ""}
#  - {action: "on_role_change", script: ""}

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#standby-cluster
# Requirements:
# 1. the cluster name for Standby Cluster must be unique ('patroni_cluster_name' variable)
# 2. the IP addresses (or network) of the Standby Cluster servers must be added to the pg_hba.conf of the Main Cluster ('postgresql_pg_hba' variable).
patroni_standby_cluster:
  host: ""  # an address of remote master
  port: "5432"  # a port of remote master
#  primary_slot_name: ""  # which slot on the remote master to use for replication (optional)
#  restore_command: ""  # command to restore WAL records from the remote master to standby leader (optional)
#  recovery_min_apply_delay: ""  # how long to wait before actually apply WAL records on a standby leader (optional)

patroni_log_destination: stderr  # or 'logfile'
# if patroni_log_destination: logfile
patroni_log_dir: /var/log/patroni
patroni_log_level: info
patroni_log_traceback_level: error
patroni_log_format: "%(asctime)s %(levelname)s: %(message)s"
patroni_log_dateformat: ""
patroni_log_max_queue_size: 1000
patroni_log_file_num: 4
patroni_log_file_size: 25000000  # bytes
patroni_log_loggers_patroni_postmaster: warning
patroni_log_loggers_urllib3: warning  # or 'debug'

patroni_watchdog_mode: automatic  # or 'off', 'required'
patroni_watchdog_device: /dev/watchdog

patroni_postgresql_use_pg_rewind: false  # or 'false'
# try to use pg_rewind on the former leader when it joins cluster as a replica.

patroni_remove_data_directory_on_rewind_failure: false  # or 'true' (if use_pg_rewind: 'true')
# avoid removing the data directory on an unsuccessful rewind
# if 'true', Patroni will remove the PostgreSQL data directory and recreate the replica.

patroni_remove_data_directory_on_diverged_timelines: false  # or 'true'
# if 'true', Patroni will remove the PostgreSQL data directory and recreate the replica
# if it notices that timelines are diverging and the former master can not start streaming from the new master.

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#bootstrap
patroni_cluster_bootstrap_method: "initdb"  # or "wal-g", "pgbackrest", "pg_probackup"

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#building-replicas
patroni_create_replica_methods:
  - pgbackrest
  # - wal_g
#  - pg_probackup
  - basebackup

pgbackrest:
  - { option: "command", value: "/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore" }
  - { option: "keep_data", value: "True" }
  - { option: "no_params", value: "True" }
wal_g:
  - { option: "command", value: "wal-g backup-fetch {{ postgresql_data_dir }} LATEST" }
  - { option: "no_params", value: "True" }
basebackup:
  - { option: "max-rate", value: "100M" }
  - { option: "checkpoint", value: "fast" }
pg_probackup:
  - { option: "command", value: "pg_probackup-{{ pg_probackup_version }} restore -B {{ pg_probackup_dir }} --instance {{ pg_probackup_instance }} -j {{ pg_probackup_threads }} {{ pg_probackup_add_keys }}" }
  - { option: "no_params", value: "true" }

# "restore_command" written to recovery.conf when configuring follower (create replica)
postgresql_restore_command: ""
# postgresql_restore_command: "wal-g wal-fetch %f %p"  # restore WAL-s using WAL-G
# postgresql_restore_command: "pgbackrest --stanza={{ pgbackrest_stanza }} archive-get %f %p"  # restore WAL-s using pgbackrest
# postgresql_restore_command: "pg_probackup-{{ pg_probackup_version }} archive-get -B {{ pg_probackup_dir }} --instance {{ pg_probackup_instance }} --wal-file-path=%p --wal-file-name=%f"  # restore WAL-s using pg_probackup

# pg_probackup
pg_probackup_install: false  # or 'true'
pg_probackup_install_from_postgrespro_repo: true  # or 'false'
pg_probackup_version: "{{ postgresql_version }}"
pg_probackup_instance: "pg_probackup_instance_name"
pg_probackup_dir: "/mnt/backup_dir"
pg_probackup_threads: "4"
pg_probackup_add_keys: "--recovery-target=latest --skip-external-dirs --no-validate"
pg_probackup_patroni_cluster_bootstrap_command: "pg_probackup-{{ pg_probackup_version }} restore -B {{ pg_probackup_dir }} --instance {{ pg_probackup_instance }} -j {{ pg_probackup_threads }} {{ pg_probackup_add_keys }}"

# WAL-G
wal_g_install: false  # or 'true'
wal_g_version: "2.0.1"
wal_g_json:  # config https://github.com/wal-g/wal-g#configuration
  - { option: "AWS_ACCESS_KEY_ID", value: "{{ AWS_ACCESS_KEY_ID | default('') }}" }  # define values or pass via --extra-vars
  - { option: "AWS_SECRET_ACCESS_KEY", value: "{{ AWS_SECRET_ACCESS_KEY | default('') }}" }  # define values or pass via --extra-vars
  - { option: "WALG_S3_PREFIX", value: "{{ WALG_S3_PREFIX | default('') }}" } # define values or pass via --extra-vars
  - { option: "WALG_COMPRESSION_METHOD", value: "brotli" }  # or "lz4", "lzma", "zstd"
  - { option: "PGDATA", value: "{{ postgresql_data_dir }}" }
  - { option: "PGHOST", value: "localhost" }
  - { option: "PGPORT", value: "{{ postgresql_port }}" }
  - { option: "PGUSER", value: "{{ patroni_superuser_username }}" }
#  - { option: "AWS_S3_FORCE_PATH_STYLE", value: "true" }  # to use Minio.io S3-compatible storage
#  - { option: "AWS_ENDPOINT", value: "http://minio:9000" }  # to use Minio.io S3-compatible storage
#  - { option: "", value: "" }
wal_g_archive_command: "wal-g wal-push %p {{ postgresql_data_dir }}"
wal_g_patroni_cluster_bootstrap_command: "wal-g backup-fetch {{ postgresql_data_dir }} LATEST"

wal_g_cron_jobs:
  - name: "WAL-G: Create daily backup"
    user: "postgres"
    file: /etc/cron.d/walg
    minute: "30"
    hour: "3"
    day: "*"
    month: "*"
    weekday: "*"
    job: "[ $(curl -s -o /dev/null -w '%{http_code}' http://{{ inventory_hostname }}:{{ patroni_restapi_port }}) = '200' ] && wal-g backup-push {{ postgresql_data_dir }}"
  - name: "WAL-G: Delete old backups" # older than 30 days (by default). Change according to your company's backup retention policy.
    user: "postgres"
    file: /etc/cron.d/walg
    minute: "30"
    hour: "6"
    day: "*"
    month: "*"
    weekday: "*"
    job: "[ $(curl -s -o /dev/null -w '%{http_code}' http://{{ inventory_hostname }}:{{ patroni_restapi_port }}) = '200' ] && wal-g delete before FIND_FULL $(date -d '-30 days' '+%FT%TZ') --confirm"

# pgBackRest
pgbackrest_install: true  # or 'true'
pgbackrest_install_from_pgdg_repo: true  # or 'false'
pgbackrest_stanza: "{{ patroni_cluster_name }}"  # specify your --stanza
pgbackrest_repo_type: "s3"  # or "s3", "gcs", "azure"
pgbackrest_repo_host: ""  # dedicated repository host (optional)
pgbackrest_repo_user: "postgres"
pgbackrest_conf_file: "/etc/pgbackrest/pgbackrest.conf"
# see more options https://pgbackrest.org/configuration.html
pgbackrest_conf:
  global:  # [global] section
    - {option: "repo1-type", value: "{{ pgbackrest_repo_type |lower }}"}
    - {option: "repo1-path", value: "{{postgresql_data_dir}}"}
    - {option: "repo1-s3-endpoint", value: "https://s3.amazonaws.com"}
    - {option: "repo1-s3-uri-style", value: "{{ WALG_S3_PREFIX | default('') }}"}
    - {option: "repo1-s3-bucket", value: "soa-backup"}
    - {option: "repo1-s3-verify-tls", value: "n"}
    - {option: "repo1-s3-key", value: "{{ AWS_ACCESS_KEY_ID | default('') }}"}
    - {option: "repo1-s3-key-secret", value: "{{ AWS_SECRET_ACCESS_KEY | default('') }}"}
    - {option: "repo1-s3-region", value: "us-east-1"}

    - {option: "repo1-retention-full", value: "4"}
    - {option: "repo1-retention-archive", value: "2"}
    - {option: "repo1-retention-archive-type", value: "full"}
    - {option: "start-fast", value: "y"}
    - {option: "delta", value: "y"}
    - {option: "archive-check", value: "y"}
    - {option: "archive-copy", value: "y"}
    - {option: "archive-timeout", value: "180"}
    - {option: "stop-auto", value: "y"}
    - {option: "resume", value: "n"}
    - {option: "link-all", value: "y"}
    - {option: "compress-type", value: "lz4"}
    - {option: "log-level-file", value: "detail"}
    - {option: "log-path", value: "/var/log/pgbackrest"}
    - {option: "process-max", value: "2"}

    ################################

#    - { option: "", value: "" }
  stanza:  # [stanza_name] section
    - { option: "pg1-path", value: "{{ postgresql_data_dir }}" }
    - { option: "recovery-option", value: "recovery_target_action=promote" }
    - { option: "log-level-console", value: "info" }
    - { option: "process-max", value: "6" }
#    - { option: "", value: "" }
# (optional) dedicated backup server config (if "repo_host" is set)
pgbackrest_server_conf:
  global:
    - { option: "log-level-file", value: "detail" }
    - { option: "log-level-console", value: "info" }
    - { option: "log-path", value: "/var/log/pgbackrest" }
    - { option: "repo1-type", value: "{{ pgbackrest_repo_type |lower }}" }
    - { option: "repo1-path", value: "/var/lib/pgbackrest" }
    - { option: "repo1-retention-full", value: "4" }
    - { option: "repo1-retention-archive", value: "4" }
    - { option: "repo1-bundle", value: "y" }
    - { option: "repo1-block", value: "y" }
    - { option: "start-fast", value: "y" }
    - { option: "stop-auto", value: "y" }
    - { option: "resume", value: "n" }
    - { option: "link-all", value: "y" }
    - { option: "archive-check", value: "y" }
    - { option: "archive-copy", value: "n" }
    - { option: "backup-standby", value: "y" }
#    - { option: "", value: "" }
# the stanza section will be generated automatically

pgbackrest_archive_command: "pgbackrest --stanza={{ pgbackrest_stanza }} archive-push %p"

pgbackrest_patroni_cluster_restore_command:
  '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore'  # restore from latest backup
#  '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=time "--target=2020-06-01 11:00:00+03" --delta restore'  # Point-in-Time Recovery (example)

# By default, the cron jobs is created on the database server.
# If 'repo_host' is defined, the cron jobs will be created on the pgbackrest server.
pgbackrest_cron_jobs:
  - name: "pgBackRest: Full Backup"
    file: /etc/cron.d/pgbackrest
    user: "postgres"
    minute: "30"
    hour: "6"
    day: "*"
    month: "*"
    weekday: "0"
    job: "pgbackrest --type=full --stanza={{ pgbackrest_stanza }} backup"
    # job: "if [ $(psql -tAXc 'select pg_is_in_recovery()') = 'f' ]; then pgbackrest --type=full --stanza={{ pgbackrest_stanza }} backup; fi"
  - name: "pgBackRest: Diff Backup"
    file: /etc/cron.d/pgbackrest
    user: "postgres"
    minute: "30"
    hour: "6"
    day: "*"
    month: "*"
    weekday: "1-6"
    job: "pgbackrest --type=diff --stanza={{ pgbackrest_stanza }} backup"
    # job: "if [ $(psql -tAXc 'select pg_is_in_recovery()') = 'f' ]; then pgbackrest --type=diff --stanza={{ pgbackrest_stanza }} backup; fi"

# PITR mode (if patroni_cluster_bootstrap_method: "pgbackrest" or "wal-g"):
# 1) The database cluster directory will be cleaned (for "wal-g") or overwritten (for "pgbackrest" --delta restore).
# 2) And also the patroni cluster "{{ patroni_cluster_name }}" will be removed from the DCS (if exist) before recovery.

disable_archive_command: true  # or 'false' to not disable archive_command after restore
keep_patroni_dynamic_json: true  # or 'false' to remove patroni.dynamic.json after restore (if exists)

# Netdata - https://github.com/netdata/netdata
netdata_install: false  # or 'true' for install Netdata on postgresql cluster nodes (with kickstart.sh)
netdata_install_options: "--stable-channel --disable-telemetry --dont-wait"
netdata_conf:
  web_bind_to: "*"
  # https://learn.netdata.cloud/docs/store/change-metrics-storage
  memory_mode: "dbengine"  # The long-term metrics storage with efficient RAM and disk usage.
  page_cache_size: 64  # Determines the amount of RAM in MiB that is dedicated to caching Netdata metric values.
  dbengine_disk_space: 1024  # Determines the amount of disk space in MiB that is dedicated to storing Netdata metric values.

...

I use ubuntu 22.04

vitabaks commented 10 months ago

CRITICAL: system ID mismatch, node ip-10-0-5-31 belongs to a different cluster: 7252567923797276858 != 7252567842164459521

it looks like you already have a cluster named "postgres-cluster" and it has a different system ID, which means that the servers you are trying to add contain data from another Postgres instance.

How did this happen? Have you pre-installed postgres on the servers manually?

To solve this problem, simply delete the data directory on the replicas, ONLY IF YOU ARE SURE THAT THIS DATA IS NOT BEING USED (for example, this is a test server). Then start the playbook again, but first specify the variable postgresql_exists=true for the master host, since judging by the logs there, the database is already deployed and running on Master.

fatmaAliGamal commented 10 months ago

there are new servers. i just run ansible-playbook deploy_pgcluster.yml -e "enable_timescale=true" to deploy code at servers nothing install manual

and I make this test at more new server at ec2 but appear this again on each time

vitabaks commented 10 months ago

Daily tests are successful and I can't reproduce the problem with different System IDs

Please attach the archive of your playbook I will check

fatmaAliGamal commented 10 months ago

@vitabaks postgresql_cluster_2.tar.xz https://drive.google.com/file/d/1Ek7klt1ZNsvkYESZOaClRM4TJprl1gdP/view?usp=drive_web i use ubuntu 22.04 instance_type = "t2.xlarge"

i run use ansible-playbook deploy_pgcluster.yml -e "enable_timescale=true" -e "AWS_ACCESS_KEY_ID=*" -e "AWS_SECRET_ACCESS_KEY=*" -e "WALG_S3_PREFIX=s3://soa-backup/soabackup/" i run code at new servers but appear same error On Thu, Jul 6, 2023 at 4:46 PM Vitaliy Kukharik @.> wrote:

Daily tests are successful and I can't reproduce the problem with different System IDs

Please attach the archive of your playbook I will check

— Reply to this email directly, view it on GitHub https://github.com/vitabaks/postgresql_cluster/issues/396#issuecomment-1623712056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK6IX5ZBKAGLG25OJRQQD7TXO26UTANCNFSM6AAAAAAZ66HZWQ . You are receiving this because you authored the thread.Message ID: @.***>

vitabaks commented 9 months ago

Is this problem still relevant to you?

fatmaAliGamal commented 9 months ago

i will check it today and reply

vitabaks commented 9 months ago

@fatmaAliGamal update the status of the issue please