vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on Patroni). Automating with Ansible.
https://postgresql-cluster.org
MIT License
1.63k stars 407 forks source link

Unable to change path of postgresql_data_dir and etcd_data_dir #790

Open JY-210 opened 6 days ago

JY-210 commented 6 days ago

Hi @vitabaks ,

I am trying to deploy the cluster on /data path. I modified etcd_data_dir in vars/main.yml from /var/lib/etcd to/data/var/lib/etcd. But the etcd will fail to start. Beside, I tried to change postgresql_data_dir in vars/Debian.yml into /data/var/lib/postgresql/...., but it shows Failed to get information on remote file (/data/var/lib/postgresql/.pgpass): Permission denied.

Please find below images: image image

May I know if have any other files that I need to change etcd and postgresql directory to make this work?

Thank you.

vitabaks commented 5 days ago

I modified etcd_data_dir in vars/main.yml from /var/lib/etcd to /data/var/lib/etcd.

I recommend leaving the etcd on the system disk and the database data on a dedicated one

Avoid storing etcd or consul data on the same disk as other resource-intensive processes, like the database. Ensure that etcd/consul and PostgreSQL data are stored on separate disks (etcd_data_dir, consul_data_path variables), and use SSDs if possible. https://postgresql-cluster.org/docs/overview/recommendations#distributed-consensus-store-dcs

But the etcd will fail to start.

Please attach etcd logs. https://postgresql-cluster.org/docs/management/basic-commands#etcd

Beside, I tried to change postgresql_data_dir in vars/Debian.yml into /data/var/lib/postgresql/...., but it shows Failed to get information on remote file (/data/var/lib/postgresql/.pgpass): Permission denied.

Do not change the value of postgresql_home_dir variable or provide the postgres owner for this directory.

vitabaks commented 5 days ago

@JY-210 related PR for pgpass - https://github.com/vitabaks/postgresql_cluster/pull/791

JY-210 commented 5 days ago

hi @vitabaks ,

Thanks for the reply.

My ETCD and PostgresNode are on different servers, so their data wont be at the same disk. In this case, is changing etcd_data_dir to /data/var/lib/etcd should work?

I will try on the solution for pgpass that you mentioned.

Thank you.

vitabaks commented 5 days ago

In this case, is changing etcd_data_dir to /data/var/lib/etcd should work?

yes, it should work and if there is an error, then logs are needed for analysis.

JY-210 commented 5 days ago

I face similar issue on task Prepare PostgreSQL | generate default postgresql config files

I modified Debian.yml file to add in "/data", but postgresql_home_dir remain unchange.

# PostgreSQL variables
postgresql_cluster_name: "main"
# You can specify custom data dir path. Example: "/pgdata/{{ postgresql_version }}/main"
postgresql_data_dir: "/data/var/lib/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
# You can specify custom WAL dir path. Example: "/pgwal/{{ postgresql_version }}/pg_wal"
postgresql_wal_dir: ""  # if defined, symlink will be created [optional]
postgresql_conf_dir: "/data/etc/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
postgresql_bin_dir: "/data/usr/lib/postgresql/{{ postgresql_version }}/bin"
postgresql_log_dir: "/data/var/log/postgresql"
postgresql_unix_socket_dir: "/data/var/run/postgresql"
postgresql_home_dir: "/var/lib/postgresql"

This is the error hit. FAILED! => {"changed": true, "cmd": ["/usr/bin/pg_createcluster", "16", "main", "-d", "/data/var/lib/postgresql/16/main", "-p", "5432", "-e", "UTF8", "--locale", "en_US.UTF-8"], "delta": "0:00:01.422409", "end": "2024-10-15 18:40:26.585297", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2024-10-15 18:40:25.162888", "stderr": "initdb: error: could not access directory \"/data/var/lib/postgresql/16/main\": Permission denied\nError: initdb failed", "stderr_lines": ["initdb: error: could not access directory \"/data/var/lib/postgresql/16/main\": Permission denied", "Error: initdb failed"], "stdout": "Creating new PostgreSQL cluster 16/main ...\n/usr/lib/postgresql/16/bin/initdb -D /data/var/lib/postgresql/16/main --auth-local peer --auth-host scram-sha-256 --no-instructions --encoding UTF8 --locale en_US.UTF-8\nThe files belonging to this database system will be owned by user \"postgres\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"en_US.UTF-8\".\nThe default text search configuration will be set to \"english\".\n\nData page checksums are disabled.", "stdout_lines": ["Creating new PostgreSQL cluster 16/main ...", "/usr/lib/postgresql/16/bin/initdb -D /data/var/lib/postgresql/16/main --auth-local peer --auth-host scram-sha-256 --no-instructions --encoding UTF8 --locale en_US.UTF-8", "The files belonging to this database system will be owned by user \"postgres\".", "This user must also own the server process.", "", "The database cluster will be initialized with locale \"en_US.UTF-8\".", "The default text search configuration will be set to \"english\".", "", "Data page checksums are disabled."]}

I tried same solution to run the task as root user by editing patroni main.yml file but no luck.

    - name: Prepare PostgreSQL | generate default postgresql config files
      become: true
      become_user: root
      ansible.builtin.command: >
        /usr/bin/pg_createcluster {{ postgresql_version }} {{ postgresql_cluster_name }}
        -d {{ postgresql_data_dir }}
        -p {{ postgresql_port }}
        -e {{ postgresql_encoding }}
        --locale {{ postgresql_locale }}
      register: pg_createcluster_result
      failed_when: pg_createcluster_result.rc != 0
      when: (ansible_os_family == "Debian" and
            postgresql_packages|join(" ") is not search("postgrespro")) and
            not postgresql_conf_file.stat.exists

Am I doing something wrong on the file?

vitabaks commented 5 days ago

I tried same solution to run the task as root user by editing patroni main.yml file but no luck.

Judging by the log, there is a problem at the pg_createcluster stage and not when creating the pgpass file, so this is not related.

"initdb: error: could not access directory \"/data/var/lib/postgresql/16/main\": Permission denied

But it is not yet clear why there is no access to the directory, although we first create a directory with the necessary rights.

  1. Attach the full ansible log
  2. Show the result of the command execution
ls -la /data
ls -l /data/var/lib/postgresql/16/main

P.S.

You've chosen a slightly strange data path, I usually use something like:

postgresql_data_dir: "/data/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
JY-210 commented 5 days ago

I am facing a weird issue now. No matter what path I specified for postgresql_data_dir, the path for master and replica nodes will be different and cause the cluster to fail.

This is the path I specify:

# PostgreSQL variables
postgresql_cluster_name: "main"
# You can specify custom data dir path. Example: "/pgdata/{{ postgresql_version }}/main"
postgresql_data_dir: "/data/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
# You can specify custom WAL dir path. Example: "/pgwal/{{ postgresql_version }}/pg_wal"
postgresql_wal_dir: ""  # if defined, symlink will be created [optional]
postgresql_conf_dir: "/etc/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
postgresql_bin_dir: "/usr/lib/postgresql/{{ postgresql_version }}/bin"
postgresql_log_dir: "/var/log/postgresql"
postgresql_unix_socket_dir: "/var/run/postgresql"
postgresql_home_dir: "/var/lib/postgresql"

But on my pgnodes, the data path is different. Master node: image

Replica nodes: image image

This is the error at replica nodes image

Have you saw this before that the data path is inconsistent among the nodes?

vitabaks commented 5 days ago

This should not be the case, since the variable is common to all nodes, then the data path should be set the same for all nodes.

So that I can help you, please attach the following

  1. full ansible log
  2. configuration file patroni.yml (from all nodes)
  3. entire postgresql_cluster directory (to investigate all changes)
JY-210 commented 4 days ago

Please find attached ansible log and configuration file patroni.yml (from all nodes). ansible log.txt pgnode01_patroni.txt pgnode02_patroni.txt pgnode03_patroni.txt

Sorry to say I can't attach the entire postgresql_cluster directory because it is on protected VM and I failed to scp out. Below are changes that I made apart from host ip and password.

# PostgreSQL variables
postgresql_cluster_name: "main"
# You can specify custom data dir path. Example: "/pgdata/{{ postgresql_version }}/main"
postgresql_data_dir: "/data/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
# You can specify custom WAL dir path. Example: "/pgwal/{{ postgresql_version }}/pg_wal"
postgresql_wal_dir: ""  # if defined, symlink will be created [optional]
postgresql_conf_dir: "/etc/postgresql/{{ postgresql_version }}/{{ postgresql_cluster_name }}"
postgresql_bin_dir: "/usr/lib/postgresql/{{ postgresql_version }}/bin"
postgresql_log_dir: "/var/log/postgresql"
postgresql_unix_socket_dir: "/var/run/postgresql"
postgresql_home_dir: "/var/lib/postgresql"

For your information, I am setting up cluster with 3 pgnodes and 3 etcd without haproxy. And I am using version 1.11.0 branch.

JY-210 commented 4 days ago

hi @vitabaks ,

If I use console GUI to deploy the cluster, is it possible to change the postgresql_data_dir?

vitabaks commented 4 days ago

If I use console GUI to deploy the cluster, is it possible to change the postgresql_data_dir?

Not at the moment. This will be provided in one of the next releases where there will be an Expert Mode.

vitabaks commented 3 days ago

A strange situation, I checked the patroni configuration and there is one path for all nodes

data_dir: /data/postgresql/16/main

Try to remove PGDATA and cluster information from DCS and deploy the cluster again

pg_dropcluster 16 main
rm -rf /data/postgresql/16/main
patronictl remove postgres-cluster
JY-210 commented 3 days ago

I manage to recreate the cluster after uninstall all postgresql and patroni package.

My cluster now have 3 nodes - 1 master, 1 synchronous replica and 1 asynchronous replica. I had also configured a cluster VIP.

When I failover from master to synchronous replica, it works well. and the master node will become synchronous replica now. image

But when failover to asynchronous replica, both replica become asynchronous and the master (initially asynchronous replica) and VIP are not working, can't ping and telnet. It also fail to failover to master node as well. image

Do you have any idea on this?

vitabaks commented 3 days ago

If you are using vip-manager, make sure that the version is not lower than 2.6.0

JY-210 commented 3 days ago

Yes im using version 2.6.0.

image

These are the errors when trying to failover to original master

Are you sure you want to failover cluster postgres-cluster, demoting current leader pgnode03? [y/N]: y
2024-10-17 21:42:05,718 - WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f215a0>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')': /failover
2024-10-17 21:42:35,926 - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f21630>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')': /failover
2024-10-17 21:43:06,134 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f217e0>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')': /failover
2024-10-17 21:43:36,342 - ERROR - None
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/patroni/dcs/etcd.py", line 608, in create_connection_patched
    raise err
  File "/usr/local/lib/python3.10/dist-packages/patroni/dcs/etcd.py", line 598, in create_connection_patched
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 395, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 174, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x7f2091f21900>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/ctl.py", line 1327, in _do_failover_or_switchover
    r = request_patroni(member, 'post', action, failover_value)
  File "/usr/local/lib/python3.10/dist-packages/patroni/ctl.py", line 375, in request_patroni
    return request_executor(member, method, endpoint, data)
  File "/usr/local/lib/python3.10/dist-packages/patroni/request.py", line 171, in __call__
    return self.request(method, url, data, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/patroni/request.py", line 152, in request
    return self._pool.request(method.upper(), url, body=body, **kwargs)
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 784, in urlopen
    return self.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 784, in urlopen
    return self.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 784, in urlopen
    return self.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
    retries = retries.increment(
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.120.31', port=8008): Max retries exceeded with url: /failover (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f2091f21900>, 'Connection to 192.168.120.31 timed out. (connect timeout=<object object at 0x7f20942f8fc0>)'))
2024-10-17 21:43:36,344 - WARNING - Failing over to DCS
2024-10-17 21:43:36.34490 Could not failover using Patroni api, falling back to DCS

The status of patroni and vip-manager looks healthy on current master (originally async replica). image

However, the other replica (originally master) cant connect to current master. image

JY-210 commented 2 days ago

hi @vitabaks ,

TASK [patroni : Start patroni service on Replica servers] **********************************************************************************************************************************************************************************
changed: [192.168.120.31]
changed: [192.168.7.32]

TASK [patroni : Wait for port 8008 to become open on the host] *****************************************************************************************************************************************************************************
ok: [192.168.120.31]
ok: [192.168.7.32]
FAILED - RETRYING: [192.168.120.31]: Check that the patroni is healthy on the replica server (1200 retries left).

TASK [patroni : Check that the patroni is healthy on the replica server] *******************************************************************************************************************************************************************
ok: [192.168.7.32]
ok: [192.168.120.31]

Would like to double check with you, is the line FAILED - RETRYING: [192.168.120.31]: Check that the patroni is healthy on the replica server (1200 retries left). matters?

vitabaks commented 2 days ago

If you see this error, then the Patroni REST API for endpoint '/health' did not return 200. Next, study the Patroni/Postgres logs to understand the reason.

Code: https://github.com/vitabaks/postgresql_cluster/blob/1.11.0/roles/patroni/tasks/main.yml#L948