Closed JY-210 closed 1 month ago
I modified etcd_data_dir in vars/main.yml from /var/lib/etcd to /data/var/lib/etcd.
I recommend leaving the etcd on the system disk and the database data on a dedicated one
Avoid storing etcd or consul data on the same disk as other resource-intensive processes, like the database. Ensure that etcd/consul and PostgreSQL data are stored on separate disks (etcd_data_dir, consul_data_path variables), and use SSDs if possible. https://postgresql-cluster.org/docs/overview/recommendations#distributed-consensus-store-dcs
But the etcd will fail to start.
Please attach etcd logs. https://postgresql-cluster.org/docs/management/basic-commands#etcd
Beside, I tried to change postgresql_data_dir in vars/Debian.yml into /data/var/lib/postgresql/...., but it shows Failed to get information on remote file (/data/var/lib/postgresql/.pgpass): Permission denied.
Do not change the value of postgresql_home_dir
variable or provide the postgres
owner for this directory.
@JY-210 related PR for pgpass - https://github.com/vitabaks/postgresql_cluster/pull/791
hi @vitabaks ,
Thanks for the reply.
My ETCD and PostgresNode are on different servers, so their data wont be at the same disk. In this case, is changing etcd_data_dir to /data/var/lib/etcd should work?
I will try on the solution for pgpass that you mentioned.
Thank you.
In this case, is changing etcd_data_dir to /data/var/lib/etcd should work?
yes, it should work and if there is an error, then logs are needed for analysis.
I face similar issue on task Prepare PostgreSQL | generate default postgresql config files
I modified Debian.yml file to add in "/data", but postgresql_home_dir remain unchange.
# PostgreSQL variables
postgresql_cluster_name: "main"
# You can specify custom data dir path. Example: "/pgdata/{{ postgresql_version }}/main"
postgresql_data_dir: "/data/var/lib/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
# You can specify custom WAL dir path. Example: "/pgwal/{{ postgresql_version }}/pg_wal"
postgresql_wal_dir: "" # if defined, symlink will be created [optional]
postgresql_conf_dir: "/data/etc/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
postgresql_bin_dir: "/data/usr/lib/postgresql/{{ postgresql_version }}/bin"
postgresql_log_dir: "/data/var/log/postgresql"
postgresql_unix_socket_dir: "/data/var/run/postgresql"
postgresql_home_dir: "/var/lib/postgresql"
This is the error hit.
FAILED! => {"changed": true, "cmd": ["/usr/bin/pg_createcluster", "16", "main", "-d", "/data/var/lib/postgresql/16/main", "-p", "5432", "-e", "UTF8", "--locale", "en_US.UTF-8"], "delta": "0:00:01.422409", "end": "2024-10-15 18:40:26.585297", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2024-10-15 18:40:25.162888", "stderr": "initdb: error: could not access directory \"/data/var/lib/postgresql/16/main\": Permission denied\nError: initdb failed", "stderr_lines": ["initdb: error: could not access directory \"/data/var/lib/postgresql/16/main\": Permission denied", "Error: initdb failed"], "stdout": "Creating new PostgreSQL cluster 16/main ...\n/usr/lib/postgresql/16/bin/initdb -D /data/var/lib/postgresql/16/main --auth-local peer --auth-host scram-sha-256 --no-instructions --encoding UTF8 --locale en_US.UTF-8\nThe files belonging to this database system will be owned by user \"postgres\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"en_US.UTF-8\".\nThe default text search configuration will be set to \"english\".\n\nData page checksums are disabled.", "stdout_lines": ["Creating new PostgreSQL cluster 16/main ...", "/usr/lib/postgresql/16/bin/initdb -D /data/var/lib/postgresql/16/main --auth-local peer --auth-host scram-sha-256 --no-instructions --encoding UTF8 --locale en_US.UTF-8", "The files belonging to this database system will be owned by user \"postgres\".", "This user must also own the server process.", "", "The database cluster will be initialized with locale \"en_US.UTF-8\".", "The default text search configuration will be set to \"english\".", "", "Data page checksums are disabled."]}
I tried same solution to run the task as root user by editing patroni main.yml file but no luck.
- name: Prepare PostgreSQL | generate default postgresql config files
become: true
become_user: root
ansible.builtin.command: >
/usr/bin/pg_createcluster {{ postgresql_version }} {{ postgresql_cluster_name }}
-d {{ postgresql_data_dir }}
-p {{ postgresql_port }}
-e {{ postgresql_encoding }}
--locale {{ postgresql_locale }}
register: pg_createcluster_result
failed_when: pg_createcluster_result.rc != 0
when: (ansible_os_family == "Debian" and
postgresql_packages|join(" ") is not search("postgrespro")) and
not postgresql_conf_file.stat.exists
Am I doing something wrong on the file?
I tried same solution to run the task as root user by editing patroni main.yml file but no luck.
Judging by the log, there is a problem at the pg_createcluster stage and not when creating the pgpass file, so this is not related.
"initdb: error: could not access directory \"/data/var/lib/postgresql/16/main\": Permission denied
But it is not yet clear why there is no access to the directory, although we first create a directory with the necessary rights.
ls -la /data
ls -l /data/var/lib/postgresql/16/main
P.S.
You've chosen a slightly strange data path, I usually use something like:
postgresql_data_dir: "/data/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
I am facing a weird issue now. No matter what path I specified for postgresql_data_dir, the path for master and replica nodes will be different and cause the cluster to fail.
This is the path I specify:
# PostgreSQL variables
postgresql_cluster_name: "main"
# You can specify custom data dir path. Example: "/pgdata/{{ postgresql_version }}/main"
postgresql_data_dir: "/data/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
# You can specify custom WAL dir path. Example: "/pgwal/{{ postgresql_version }}/pg_wal"
postgresql_wal_dir: "" # if defined, symlink will be created [optional]
postgresql_conf_dir: "/etc/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
postgresql_bin_dir: "/usr/lib/postgresql/{{ postgresql_version }}/bin"
postgresql_log_dir: "/var/log/postgresql"
postgresql_unix_socket_dir: "/var/run/postgresql"
postgresql_home_dir: "/var/lib/postgresql"
But on my pgnodes, the data path is different. Master node:
Replica nodes:
This is the error at replica nodes
Have you saw this before that the data path is inconsistent among the nodes?
This should not be the case, since the variable is common to all nodes, then the data path should be set the same for all nodes.
So that I can help you, please attach the following
patroni.yml
(from all nodes)Please find attached ansible log and configuration file patroni.yml (from all nodes). ansible log.txt pgnode01_patroni.txt pgnode02_patroni.txt pgnode03_patroni.txt
Sorry to say I can't attach the entire postgresql_cluster directory because it is on protected VM and I failed to scp out. Below are changes that I made apart from host ip and password.
# PostgreSQL variables
postgresql_cluster_name: "main"
# You can specify custom data dir path. Example: "/pgdata/{{ postgresql_version }}/main"
postgresql_data_dir: "/data/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
# You can specify custom WAL dir path. Example: "/pgwal/{{ postgresql_version }}/pg_wal"
postgresql_wal_dir: "" # if defined, symlink will be created [optional]
postgresql_conf_dir: "/etc/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
postgresql_bin_dir: "/usr/lib/postgresql/{{ postgresql_version }}/bin"
postgresql_log_dir: "/var/log/postgresql"
postgresql_unix_socket_dir: "/var/run/postgresql"
postgresql_home_dir: "/var/lib/postgresql"
For your information, I am setting up cluster with 3 pgnodes and 3 etcd without haproxy. And I am using version 1.11.0 branch.
hi @vitabaks ,
If I use console GUI to deploy the cluster, is it possible to change the postgresql_data_dir?
If I use console GUI to deploy the cluster, is it possible to change the postgresql_data_dir?
Not at the moment. This will be provided in one of the next releases where there will be an Expert Mode.
A strange situation, I checked the patroni configuration and there is one path for all nodes
data_dir: /data/postgresql/16/main
Try to remove PGDATA and cluster information from DCS and deploy the cluster again
pg_dropcluster 16 main
rm -rf /data/postgresql/16/main
patronictl remove postgres-cluster
I manage to recreate the cluster after uninstall all postgresql and patroni package.
My cluster now have 3 nodes - 1 master, 1 synchronous replica and 1 asynchronous replica. I had also configured a cluster VIP.
When I failover from master to synchronous replica, it works well. and the master node will become synchronous replica now.
But when failover to asynchronous replica, both replica become asynchronous and the master (initially asynchronous replica) and VIP are not working, can't ping and telnet. It also fail to failover to master node as well.
Do you have any idea on this?
If you are using vip-manager, make sure that the version is not lower than 2.6.0
Yes im using version 2.6.0.
These are the errors when trying to failover to original master
Are you sure you want to failover cluster postgres-cluster, demoting current leader pgnode03? [y/N]: y
2024-10-17 21:42:05,718 - WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f215a0>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')': /failover
2024-10-17 21:42:35,926 - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f21630>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')': /failover
2024-10-17 21:43:06,134 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f217e0>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')': /failover
2024-10-17 21:43:36,342 - ERROR - None
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.10/dist-packages/patroni/dcs/etcd.py", line 608, in create_connection_patched
raise err
File "/usr/local/lib/python3.10/dist-packages/patroni/dcs/etcd.py", line 598, in create_connection_patched
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 395, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.10/http/client.py", line 1283, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
self.send(msg)
File "/usr/lib/python3.10/http/client.py", line 976, in send
self.connect()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 174, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x7f2091f21900>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/ctl.py", line 1327, in _do_failover_or_switchover
r = request_patroni(member, 'post', action, failover_value)
File "/usr/local/lib/python3.10/dist-packages/patroni/ctl.py", line 375, in request_patroni
return request_executor(member, method, endpoint, data)
File "/usr/local/lib/python3.10/dist-packages/patroni/request.py", line 171, in __call__
return self.request(method, url, data, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/request.py", line 152, in request
return self._pool.request(method.upper(), url, body=body, **kwargs)
File "/usr/lib/python3/dist-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/usr/lib/python3/dist-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 784, in urlopen
return self.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 784, in urlopen
return self.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 784, in urlopen
return self.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.120.31', port=8008): Max retries exceeded with url: /failover (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f21900>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)'))
2024-10-17 21:43:36,344 - WARNING - Failing over to DCS
2024-10-17 21:43:36.34490 Could not failover using Patroni api, falling back to DCS
The status of patroni and vip-manager looks healthy on current master (originally async replica).
However, the other replica (originally master) cant connect to current master.
hi @vitabaks ,
TASK [patroni : Start patroni service on Replica servers] **********************************************************************************************************************************************************************************
changed: [192.168.120.31]
changed: [192.168.7.32]
TASK [patroni : Wait for port 8008 to become open on the host] *****************************************************************************************************************************************************************************
ok: [192.168.120.31]
ok: [192.168.7.32]
FAILED - RETRYING: [192.168.120.31]: Check that the patroni is healthy on the replica server (1200 retries left).
TASK [patroni : Check that the patroni is healthy on the replica server] *******************************************************************************************************************************************************************
ok: [192.168.7.32]
ok: [192.168.120.31]
Would like to double check with you, is the line FAILED - RETRYING: [192.168.120.31]: Check that the patroni is healthy on the replica server (1200 retries left).
matters?
If you see this error, then the Patroni REST API for endpoint '/health
' did not return 200. Next, study the Patroni/Postgres logs to understand the reason.
Code: https://github.com/vitabaks/postgresql_cluster/blob/1.11.0/roles/patroni/tasks/main.yml#L948
Those issues seems due to my pgnodes are on different VLAN and subnets, they can't use the same VIP when failover. Is assigning 2 VIPs possible?
Is assigning 2 VIPs possible?
Yes, you can make two groups of HAProxy load balancers using keepalived. To do this, create two inventory files and specify the cluster_vip
variable on the armor of the balancers group. Next, use balancers.yml
playbook to deploy load balancers.
This method is often used when creating a cluster in several data centers/regions.
P.S. If you require individual support or assistance in implementing this solution, please consider joining our sponsorship program, which offers tailored support for sponsors https://postgresql-cluster.org/docs/sponsor
I had found another approach for my cluster. Appreciate and thanks for the assistance!
Hi @vitabaks ,
I am trying to deploy the cluster on /data path. I modified
etcd_data_dir
in vars/main.yml from/var/lib/etcd
to/data/var/lib/etcd
. But the etcd will fail to start. Beside, I tried to changepostgresql_data_dir
in vars/Debian.yml into/data/var/lib/postgresql/....
, but it showsFailed to get information on remote file (/data/var/lib/postgresql/.pgpass): Permission denied
.Please find below images:
May I know if have any other files that I need to change etcd and postgresql directory to make this work?
Thank you.