vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.29k stars 352 forks source link

Method : 2 Dc 2 Patroni Cluster among Standby Cluster #253

Closed prensgold closed 1 year ago

prensgold commented 1 year ago

Hi Vitaliy,

is the following architecture possible? if possible , How can I do that with your playbook?

DC1 has DC1node1 (Patroni + etcd leader + PostgreSQL master), DC1node2 (Patroni + etcd follower + PostgreSQL replica) and DC1node3 (Patroni + etcd follower + PostgreSQL replica)

standby cluster among them.

DC2 has DC2node1 (Patroni + etcd follower + PostgreSQL replica) DC2node2 (Patroni + etcd follower + PostgreSQL replica) and DC2node2 (Patroni + etcd follower + PostgreSQL replica) (6h delay replica server)

Thanks

vitabaks commented 1 year ago

Yes

vitabaks commented 1 year ago

Hi @prensgold

See patroni_standby_cluster variable https://github.com/vitabaks/postgresql_cluster/blob/master/vars/main.yml#L299

vitabaks commented 1 year ago

(6h delay replica server)

after cluster deployment, add recovery_min_apply_delay parameter for this server.

alter system set recovery_min_apply_delay='6h';
select pg_reload_conf();

https://postgresqlco.nf/doc/en/param/recovery_min_apply_delay/

prensgold commented 1 year ago

I can only write one ip in the following parameter, right? For DC2node1,

patroni_standby_cluster: host: "" # an address of remote master port: "5432" # a port of remote master

so how do I create other standby cluster replicas in DC2? What kind of definition should I configure for DC2node2 and DC2node3 in the your playbook?

vitabaks commented 1 year ago

patroni_standby_cluster.host - any server from the main cluster

so how do I create other standby cluster replicas in DC2?

They will be created automatically replica servers are defined in the group [replica] in the inventory file.

vitabaks commented 1 year ago

How to deploy the Standby cluster:

  1. Deploy a Main Patroni Cluster (if it doesn't already exist)
# 0.Install Ansible on one control node (which could easily be a laptop)
sudo apt update && sudo apt install -y ansible git

# 1. clone postgresql_cluster repository
git clone https://github.com/vitabaks/postgresql_cluster.git

# 2. Go to the playbook directory
cd postgresql_cluster/

# 3. Edit the inventory file
nano inventory
## Specify (non-public) IP addresses and connection settings (`ansible_user`, `ansible_ssh_pass` or `ansible_ssh_private_key_file` for your environment

# 4. Edit the variable file vars/main.yml
nano vars/main.yml
## Specify the rules to postgresql_pg_hba for Standby Cluster

# example for all networks "0.0.0.0/0":
postgresql_pg_hba:
  - { type: "local", database: "all", user: "{{ patroni_superuser_username }}", address: "", method: "trust" }
  - { type: "local", database: "all", user: "all", address: "", method: "peer" }
  - { type: "host", database: "all", user: "all", address: "127.0.0.1/32", method: "{{ postgresql_password_encryption_algorithm }}" }
  - { type: "host", database: "all", user: "all", address: "::1/128", method: "{{ postgresql_password_encryption_algorithm }}" }
  - { type: "host", database: "all", user: "all", address: "0.0.0.0/0", method: "{{ postgresql_password_encryption_algorithm }}" }
  -  { type: "host", database: "replication", user: "{{ patroni_replication_username }}", address: "0.0.0.0/0", method: "{{ postgresql_password_encryption_algorithm }}" }

# 5. deploy Cluster
ansible-playbook deploy_pgcluster.yml
  1. Deploy a Standby Patroni Cluster
# 1. Specify (non-public) IP Addresses and connection settings for Standby cluster
# (ansible_user, ansible_ssh_private_key_file)
nano inventory

# 2. Change patroni_cluster_name and set patroni_standby_cluster.host
nano vars/main.yml

Example:

patroni_cluster_name: "standby-postgres-cluster"
patroni_standby_cluster:
  host: "10.128.64.140"
  port: "5432"

# 3. deploy Standby Cluster
ansible-playbook deploy_pgcluster.yml

Example: https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/34#note_1089758878

vitabaks commented 1 year ago

@prensgold If you do the deployments in 1 time and with one patroni_cluster_name then it will be one cluster. \ If you follow the instructions above, then you will have two clusters (main and standby)

prensgold commented 1 year ago

When doing step 2, I also need to make the following parameter true, right?

postgresql_exists='true'

vitabaks commented 1 year ago

no, since it is assumed that there is no database for the standby cluster yet.

In my example, before deploying the standby cluster, you redefine the server addresses in the inventory file (remove the addresses of the main cluster servers and add new servers for the standby cluster)

Or you can create a separate inventory file

Example:

ansible-playbook -i inventory_standby deploy_pgcluster.yml
prensgold commented 1 year ago

thank you for the guidance

prensgold commented 1 year ago

I'm trying this architectural structure for the first time. I thought my questions were over, but I have to set up a logical replication between them so that this cluster can talk to each other, right? When I set up this replication between them, will the restrictions in the "31.6. Restrictions" section in the link below also apply? https://www.postgresql.org/docs/current/logical-replication-restrictions.html

how accurate do you think this architectural approach is to production? Do you think there is a more appropriate architectural approach in a 2 DC structure?

Can I also configure using your playbook to set up the replication between the two? ansible-playbook -i inventory/test.yml physical_to_logical.yml

vitabaks commented 1 year ago

A cluster that is deployed using playbooks from this repository uses physical replication and does not use logical replication. I do not recommend using logical replication due to the complexity of its maintenance (and also if you have a high write load). See https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/34#it-is-important-to-know

Standby cluster uses a physical replica, although standby leader is marked as leader, but in fact it is a read-only replica. Thus, the standby cluster is suitable only for Disaster Recovery scenario or for example if you plan to send read-only traffic to this cluster.

vitabaks commented 1 year ago

Can I also configure using your playbook to set up the replication between the two? ansible-playbook -i inventory/test.yml physical_to_logical.yml

The physical_to_logical playbook goes beyond the scope of this project (at the moment) and is used as part of the logic for major upgrade for PostgreSQL. Logical replication is configured not on a permanent basis but for several hours or days.

Can you use this playbook? Yes, you can. But I recommend using physical replication rather than logical if possible.

vitabaks commented 1 year ago

@prensgold If you need individual recommendations or support, please consider sponsorship levels - https://www.patreon.com/vitabaks

prensgold commented 1 year ago

no, since it is assumed that there is no database for the standby cluster yet.

In my example, before deploying the standby cluster, you redefine the server addresses in the inventory file (remove the addresses of the main cluster servers and add new servers for the standby cluster)

Or you can create a separate inventory file

Example:

ansible-playbook -i inventory_standby deploy_pgcluster.yml

here is how the two servers connect to each other. Because in step 1 we did not give the primary site an ip to recognize the standby site. In the same way, in step 2, we do not give ip to the standby site to recognize the primary site.

prensgold commented 1 year ago

@prensgold If you need individual recommendations or support, please consider sponsorship levels - https://www.patreon.com/vitabaks

I will mention you to my manager for consultancy recruitment in the company. But in order to receive consultancy from you, you need to establish a company in Turkey. Do you have a company that we can get consultancy from you?

vitabaks commented 1 year ago

here is how the two servers connect to each other. Because in step 1 we did not give the primary site an ip to recognize the standby site. In the same way, in step 2, we do not give ip to the standby site to recognize the primary site.

Maybe I didn't understand what exactly the problem is. In fact, deploying a standby cluster is very simple.

Do you have a company that we can get consultancy from you?

My individual consultations (1:1 with me) I provide only for sponsors on Patreon. \ For Enterprise-level support (with contract) You can contact https://postgres.ai/contact/

prensgold commented 1 year ago

Maybe I didn't understand what exactly the problem is. In fact, deploying a standby cluster is very simple.

When setting up standby cluster, we deploy with two steps. In both steps, I could not understand where we define the process for site1 and site2 servers to recognise each other.

My individual consultations (1:1 with me) I provide only for sponsors on Patreon. For Enterprise-level support (with contract) You can contact https://postgres.ai/contact/

I will examine this site.

There are many articles on this site in the disadvantages section of physical replication. https://linuxhint.com/setup-physical-replication-postgresql/ Do they really still exist as problems?

vitabaks commented 1 year ago

In both steps, I could not understand where we define the process for site1 and site2 servers to recognise each other.

at the second stage (site 2), we specify the IP address in the patroni_standby_cluster.host variable of one of the cluster servers (from site 1) that was deployed at the first stage.

prensgold commented 1 year ago

at the second stage (site 2), we specify the IP address in the patroni_standby_cluster.host variable of one of the cluster servers (from site 1) that was deployed at the first stage.

this line was really descriptive for me.

I will test the disadvantages in the link after installation.After your guidance, I start the test setup. https://linuxhint.com/setup-physical-replication-postgresql/

vitabaks commented 1 year ago

Do they really still exist as problems?

@prensgold OK let's take a look at these points. Here are my comments (briefly)толь

It replicates data asynchronously by default. That means WAL files that are not copied to the standby server may be removed when any changes are done in the primary server. It is better to set a higher value into the wal_keep_segments to prevent data loss.

There are actually two aspects involved here:

  1. Transactions with asynchronous replication and the risks of losing the latest changes in case of failure.

Due to performance reasons, a synchronous replication is disabled by default. To minimize the risk of losing data on autofailover, you can configure settings in the following way:

synchronous_mode: true  # or 'true' for enable synchronous database replication
synchronous_mode_strict: true  # if 'true' then block all client writes to the master, when a synchronous replica is not available
synchronous_node_count: 1  # number of synchronous standby databases. 

postgresql_parameters:
...
  - { option: "synchronous_commit", value: "remote_apply" }

Note that synchronous replication may be the reason for the increased time for some writing queries. Therefore, be sure to perform load testing for your database.\ If the write response time is not critical for your application then you can safely use synchronous replication or even use a quorum commit for greater reliability by specifying synchronous_node_count: 2 (if you have 3 or more replicas in the cluster).

details: https://postgresqlco.nf/doc/en/param/synchronous_commit/

  1. Risks of removing the WAL on the primary server and, as a result, stopping replication.

By default, Petroni creates replication slots for all replicas in the cluster. This saves us from the risk of WAL remove which may be necessary for a replicas, but at the same time there are risks of disk space saturation if the replication lag is too high, so be sure to take care of database monitoring.

I recommend setting the wal_keep_size or max_slot_wal_keep_size parameter to eliminate the risks associated with saturation of disk space on primary, and configure backup (for example) using pgbackrest so that WAL are available for replicas on the repository of archived WALs.

Next, monitor the archiving status in your monitoring system (to make sure that there are no archiving errors and WAL are not accumulated on the Primary database server)

It does not support replication between different PostgreSQL servers.

Yes, physical (streaming) replication is possible only within one major version. At the same time, it is possible within different minor versions (example 13.7 -> 13.9). But it is desirable that the minor versions on all database servers are identical.

The data is not more secure in this replication because the sensitive data can be extracted without user authentication.

Not in our case, because authorization is required to access the data (and for replication). And all passwords are encrypted with SCRAM-SHA-256

And of course, do not publish your databases on the Internet. Use only a secure private network for your database cluster (and applications).

It does not truncate the command. It does not support foreign tables, view, and partition root tables. It does not support large object replication. It does not support database schema and DDL commands.

All this rather refers to logical replication (although some statements are no longer true for logical replication fow new PostgreSQL versions). Physical replication is a byte-by-byte copy of the data, the data on the replicas is completely identical to the data on the primary.

Details: https://www.interdb.jp/pg/pgsql11.html

P.S.

I recommend always checking/tests, and not just trust the comments in the articles, which, by the way, may not be valid for now because PstgreSQL is developing rapidly. \ For example, try creating a view or FDW and check for the presence of these objects on the replica or standby cluster.

prensgold commented 1 year ago

How to deploy the Standby cluster:

  1. Deploy a Main Patroni Cluster (if it doesn't already exist)
# 0.Install Ansible on one control node (which could easily be a laptop)
sudo apt update && sudo apt install -y ansible git

# 1. clone postgresql_cluster repository
git clone https://github.com/vitabaks/postgresql_cluster.git

# 2. Go to the playbook directory
cd postgresql_cluster/

# 3. Edit the inventory file
nano inventory
## Specify (non-public) IP addresses and connection settings (`ansible_user`, `ansible_ssh_pass` or `ansible_ssh_private_key_file` for your environment

# 4. Edit the variable file vars/main.yml
nano vars/main.yml
## Specify the rules to postgresql_pg_hba for Standby Cluster

# example for all networks "0.0.0.0/0":
postgresql_pg_hba:
  - { type: "local", database: "all", user: "{{ patroni_superuser_username }}", address: "", method: "trust" }
  - { type: "local", database: "all", user: "all", address: "", method: "peer" }
  - { type: "host", database: "all", user: "all", address: "127.0.0.1/32", method: "{{ postgresql_password_encryption_algorithm }}" }
  - { type: "host", database: "all", user: "all", address: "::1/128", method: "{{ postgresql_password_encryption_algorithm }}" }
  - { type: "host", database: "all", user: "all", address: "0.0.0.0/0", method: "{{ postgresql_password_encryption_algorithm }}" }
  -  { type: "host", database: "replication", user: "{{ patroni_replication_username }}", address: "0.0.0.0/0", method: "{{ postgresql_password_encryption_algorithm }}" }

# 5. deploy Cluster
ansible-playbook deploy_pgcluster.yml
  1. Deploy a Standby Patroni Cluster
# 1. Specify (non-public) IP Addresses and connection settings for Standby cluster
# (ansible_user, ansible_ssh_private_key_file)
nano inventory

# 2. Change patroni_cluster_name and set patroni_standby_cluster.host
nano vars/main.yml

Example:

patroni_cluster_name: "standby-postgres-cluster"
patroni_standby_cluster:
  host: "10.128.64.140"
  port: "5432"

# 3. deploy Standby Cluster
ansible-playbook deploy_pgcluster.yml

Example: https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/34#note_1089758878

When deploying standby cluster, it gives the following error for the patroni step.

"FAILED - RETRYING: Check that the patroni is healthy on the replica server (1200 retries left). FAILED - RETRYING: Check that the patroni is healthy on the replica server (1200 retries left)."

I added standby cluster replica server ip's on the master side in pg_hba.conf content,error fixed. In standby cluster setup, standby replica ip should be added to main cluster pg_hba.conf content. This information can be added to the documentation.

but in the next steps I got error like below.

TASK [pgbouncer/config : Get users and password md5 from pg_shadow] ***** dict object' has no attribute 'master'"}

I could not find a solution on internet. Have you experienced this problem with standby cluster setup before?

vitabaks commented 1 year ago

In standby cluster setup, standby replica ip should be added to main cluster pg_hba.conf content. This information can be added to the documentation.

Ok, thanks! commit: https://github.com/vitabaks/postgresql_cluster/commit/ca77724a507b3e85cbaa8933ede78539e94a5afe

Have you experienced this problem with standby cluster setup before?

No I have not encountered a similar problem when preparing a standby cluster.

@prensgold What changes have you made to the inventory file? \ Is there a master group there and is the server listed in this group?

prensgold commented 1 year ago

No I have not encountered a similar problem when preparing a standby cluster.

@prensgold What changes have you made to the inventory file? Is there a master group there and is the server listed in this group?

I edited the inventory file as follows.


# "postgresql_exists='true'" if PostgreSQL is already exists and running
# "hostname=" variable is optional (used to change the server name)

# In this example, all components will be installed on PostgreSQL nodes.
# You can deploy the haproxy balancers and the etcd or consul cluster on other dedicated servers (recomended).

# if dcs_exists: false and dcs_type: "etcd"
[etcd_cluster]  # recommendation: 3, or 5-7 nodes
192.168.32.36
192.168.32.37
192.168.32.38

# if dcs_exists: false and dcs_type: "consul"
[consul_instances]  # recommendation: 3 or 5-7 nodes
#10.128.64.140 consul_node_role=server consul_bootstrap_expect=true consul_datacenter=dc1
#10.128.64.142 consul_node_role=server consul_bootstrap_expect=true consul_datacenter=dc1
#10.128.64.143 consul_node_role=server consul_bootstrap_expect=true consul_datacenter=dc1
#10.128.64.144 consul_node_role=client consul_datacenter=dc1
#10.128.64.145 consul_node_role=client consul_datacenter=dc2

# if with_haproxy_load_balancing: true
[balancers]
192.168.32.36
192.168.32.37
192.168.32.38

# PostgreSQL nodes
[replica]
192.168.32.36 hostname=ltstpgrhap01 postgresql_exists='false'
192.168.32.37 hostname=ltstpgrhap02 postgresql_exists='false'
192.168.32.38 hostname=ltstpgrhap03 postgresql_exists='false'

[postgres_cluster:children]
replica

# if pgbackrest_install: true and "repo_host" is set
[pgbackrest]  # optional (Dedicated Repository Host)

# Connection settings
.
.
.

Since the master server information is in "vars/main.yml", I did not add it to the inventory file. Do I need to add?

vitabaks commented 1 year ago

@prensgold Please return the inventory file to its original state because the code assumes the presence of the master group.

Example:

# if dcs_exists: false and dcs_type: "etcd"
[etcd_cluster]
192.168.32.36
192.168.32.37
192.168.32.38

# if dcs_exists: false and dcs_type: "consul"
[consul_instances]

# if with_haproxy_load_balancing: true
[balancers]
192.168.32.36
192.168.32.37
192.168.32.38

# PostgreSQL nodes
[master]
192.168.32.36 hostname=ltstpgrhap01 postgresql_exists='false'

[replica]
192.168.32.37 hostname=ltstpgrhap02 postgresql_exists='false'
192.168.32.38 hostname=ltstpgrhap03 postgresql_exists='false'

[postgres_cluster:children]
master
replica

# if pgbackrest_install: true and "repo_host" is set
[pgbackrest]

# Connection settings
[all:vars]
ansible_connection='ssh'
ansible_ssh_port='22'
ansible_user='root'
ansible_ssh_pass='secretpassword'  # "sshpass" package is required for use "ansible_ssh_pass"
#ansible_ssh_private_key_file=
#ansible_python_interpreter='/usr/bin/python3'  # is required for use python3
prensgold commented 1 year ago

@prensgold Please return the inventory file to its original state because the code assumes the presence of the master group.

Example:

# if dcs_exists: false and dcs_type: "etcd"
[etcd_cluster]
192.168.32.36
192.168.32.37
192.168.32.38

# if dcs_exists: false and dcs_type: "consul"
[consul_instances]

# if with_haproxy_load_balancing: true
[balancers]
192.168.32.36
192.168.32.37
192.168.32.38

# PostgreSQL nodes
[master]
192.168.32.36 hostname=ltstpgrhap01 postgresql_exists='false'

[replica]
192.168.32.37 hostname=ltstpgrhap02 postgresql_exists='false'
192.168.32.38 hostname=ltstpgrhap03 postgresql_exists='false'

[postgres_cluster:children]
master
replica

# if pgbackrest_install: true and "repo_host" is set
[pgbackrest]

# Connection settings
[all:vars]
ansible_connection='ssh'
ansible_ssh_port='22'
ansible_user='root'
ansible_ssh_pass='secretpassword'  # "sshpass" package is required for use "ansible_ssh_pass"
#ansible_ssh_private_key_file=
#ansible_python_interpreter='/usr/bin/python3'  # is required for use python3

Thanks for example inventory file,that's how I was able to install it.

(6h delay replica server)

after cluster deployment, add recovery_min_apply_delay parameter for this server.

alter system set recovery_min_apply_delay='6h';
select pg_reload_conf();

https://postgresqlco.nf/doc/en/param/recovery_min_apply_delay/

Unfortunately, I did not succeed in the above test in that way. I received a warning as follows.

2023-02-20 11:14:03 +03 [68944]: [126-1] db=,user=,app=,client= LOG: received SIGHUP, reloading configuration files 2023-02-20 11:14:03 +03 [68944]: [127-1] db=,user=,app=,client= LOG: parameter "recovery_min_apply_delay" removed from configuration file, reset to default

i could do this by setting recovery_min_apply_delay to the value 1h via patroni.yaml, postgresql.parameters on one of the standby nodes.

for another test scenario, I wanted to stop replication to analyse the wrong dml or ddl query run on the primary server on the delay server. I was not successful when I ran the following commands.

SELECT pg_wal_replay_pause(); SELECT pg_get_wal_replay_pause_state();

I seem to be missing something here too.

vitabaks commented 1 year ago

i could do this by setting recovery_min_apply_delay to the value 1h via patroni.yaml, postgresql.parameters on one of the standby nodes.

Yes that's right.

Thanks for example inventory file,that's how I was able to install it.

I am glad that you succeeded in deploying a standby cluster.