vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.29k stars 352 forks source link

Failing at task etcd: Enable and start etcd service #358

Closed zuhataslan closed 10 months ago

zuhataslan commented 11 months ago

Hi, when running playbook I get the following error for each host: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}

Output of journalctl: `● etcd.service - Etcd Server Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2023-05-29 16:01:57 CEST; 17min ago Process: 340151 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/local/bin/etcd (code=exited, status=1/FAILURE) Main PID: 340151 (code=exited, status=1/FAILURE)

etcd.service: Service RestartSec=100ms expired, scheduling restart. systemd[1]: etcd.service: Scheduled restart job, restart counter is at 5. systemd[1]: Stopped Etcd Server. systemd[1]: etcd.service: Start request repeated too quickly. systemd[1]: etcd.service: Failed with result 'exit-code'`

Tried manually restart service but same error. But If I manually run /bin/bash -c "GOMAXPROCS=$(nproc) /usr/local/bin/etcd", I don't get any errors and it 'seems' to work:

{"level":"info","ts":"2023-05-29T21:45:14.443+0200","caller":"etcdserver/server.go:2062","msg":"published local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","cluster-id":"cdf818194e3a8c32","publish-timeout":"7s"} {"level":"info","ts":"2023-05-29T21:45:14.444+0200","caller":"embed/serve.go:100","msg":"ready to serve client requests"} {"level":"info","ts":"2023-05-29T21:45:14.444+0200","caller":"etcdmain/main.go:44","msg":"notifying init daemon"} {"level":"info","ts":"2023-05-29T21:45:14.444+0200","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}

Any ideas?

Remote host:

vitabaks commented 11 months ago

Hi @zuhataslan

please give more data from the etcd log

sudo journalctl -u etcd -n 100 --output=short-precise
zuhataslan commented 11 months ago

I think I found the problem although I don't yet understand the cause. Every time I run the playbook, the hostname of node3 is set to same hostname of node2.

May 30 00:54:29.892361 node02 bash[359834]: {"level":"fatal","ts":"2023-05-30T00:54:29.892+0200","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"--initial-cluster has node02=http://192.168.2.35:2380 but missing from --initial-advertise-peer-urls=http://192.168.2.36:2380 (len([\"http://192.168.2.36:2380\"]) != len([\"http://192.168.2.35:2380\" \"http://192.168.2.36:2380\"]))","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:32\nruntime.main\n\truntime/proc.go:255"}

vitabaks commented 11 months ago

Please check the hostname variables in inventory file

they must be unique for each host.

vitabaks commented 11 months ago

@zuhataslan did you to deploy an etcd cluster?

Vikontrol commented 11 months ago

I have the same problem:

Jun 05 16:21:24.347675 Project-data-s1-v1 systemd[1]: Stopped Etcd Server.
Jun 05 16:21:24.349386 Project-data-s1-v1 systemd[1]: Starting Etcd Server...
Jun 05 16:21:24.372370 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ADVERTISE_CLIENT_URLS","variable-value":"http://Project-data-s1-v1:2379"}
Jun 05 16:21:24.372370 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_DATA_DIR","variable-value":"/var/lib/etcd"}
Jun 05 16:21:24.372370 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ELECTION_TIMEOUT","variable-value":"5000"}
Jun 05 16:21:24.372370 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_HEARTBEAT_INTERVAL","variable-value":"1000"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"http://Project-data-s1-v1:2380"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"Project-data-s1-v1=http://Project-data-s1-v1:2380,Project-data-s2-v1=http://Project-data-s2-v1:2380,Project-data-s3-v1=http://Project-data-s3-v1:2380"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"new"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_TOKEN","variable-value":"etcd-postgres-cluster"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ELECTION_TICK_ADVANCE","variable-value":"false"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_CLIENT_URLS","variable-value":"http://Project-data-s1-v1:2379,http://127.0.0.1:2379"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_PEER_URLS","variable-value":"http://Project-data-s1-v1:2380"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"Project-data-s1-v1"}
Jun 05 16:21:24.372960 Project-data-s1-v1 bash[3723593]: {"level":"info","ts":"2023-06-05T16:21:24.372+0300","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/usr/local/bin/etcd"]}
Jun 05 16:21:24.373301 Project-data-s1-v1 bash[3723593]: {"level":"warn","ts":"2023-06-05T16:21:24.372+0300","caller":"etcdmain/etcd.go:75","msg":"failed to verify flags","error":"expected IP in URL for binding (http://Project-data-s1-v1:2380)"}
Jun 05 16:21:24.374122 Project-data-s1-v1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jun 05 16:21:24.374550 Project-data-s1-v1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jun 05 16:21:24.375251 Project-data-s1-v1 systemd[1]: Failed to start Etcd Server.
Jun 05 16:21:24.597020 Project-data-s1-v1 systemd[1]: etcd.service: Scheduled restart job, restart counter is at 5.
Jun 05 16:21:24.597520 Project-data-s1-v1 systemd[1]: Stopped Etcd Server.
Jun 05 16:21:24.597877 Project-data-s1-v1 systemd[1]: etcd.service: Start request repeated too quickly.
Jun 05 16:21:24.598096 Project-data-s1-v1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jun 05 16:21:24.598478 Project-data-s1-v1 systemd[1]: Failed to start Etcd Server.
vitabaks commented 11 months ago

what is "Project-data-s1-v"?

Please make sure that you have specified IP addresses in the inventory file.

The specified IP addresses will be used to listen by the cluster components.

Example:

[etcd_cluster]
10.128.64.140
10.128.64.142
10.128.64.143
Vikontrol commented 11 months ago

Damn, I have so:

Project-data-s1-v1 ansible_ssh_host=123.123.123.121
Project-data-s2-v1 ansible_ssh_host=123.123.123.122
Project-data-s3-v1 ansible_ssh_host=123.123.123.123
vitabaks commented 11 months ago

ip address or domain name must be specified.

Vikontrol commented 11 months ago

Okay, but how do I specify so that ansible goes by external IP, but configures everything by internal IP?

vitabaks commented 11 months ago

Good question, I'll think about how to implement it. In the meantime, specify only the internal ip. Use a server to run ansible from the same private network.

Vikontrol commented 11 months ago

You use the specified address in the inventory in the configuration files, but this is inconvenient for such cases when access to the configured hosts is only available via external IP addresses, but the cluster hosts have an internal network

Vikontrol commented 11 months ago

Good question, I'll think about how to implement it. In the meantime, specify only the internal ip. Use a server to run ansible from the same private network.

Thank you, you're cool

vitabaks commented 11 months ago

you can create PR to improve this part.

olwe0002 commented 11 months ago

better solution below

a possible solution, for people which need to use external IPs in the inventory, could look like this:

1. use server names in your inventory 2. on your ansible server, add external ips for the inventory server names to your local /etc/hosts 3. define the local ips in the etc_hosts in system.yaml 4. use this etcd.conf.j2 content:

ETCD_NAME="{{ ansible_hostname }}"
ETCD_LISTEN_CLIENT_URLS="http://{{ ansible_default_ipv4.address }}:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://{{ ansible_default_ipv4.address }}:2379"
ETCD_LISTEN_PEER_URLS="http://{{ ansible_default_ipv4.address }}:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://{{ ansible_default_ipv4.address }}:2380"
ETCD_INITIAL_CLUSTER_TOKEN="{{ etcd_cluster_name }}"
ETCD_INITIAL_CLUSTER="{% for host in groups['etcd_cluster'] %}{{ hostvars[host]['ansible_hostname'] }}=http://{{ hostvars[host]['ansible_default_ipv4']['address'] }}:2380{% if not loop.last %},{% endif %}{% endfor %}"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="{{ etcd_data_dir }}"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"
ETCD_INITIAL_ELECTION_TICK_ADVANCE="false"
ETCD_ENABLE_V2="true"
vitabaks commented 11 months ago

Use of Internal and External IP Addresses in Ansible Inventory

It has been identified that there may be some confusion when it comes to using both internal and external IP addresses within the Ansible inventory. Here is some clarification:

In Ansible, the inventory_hostname represents the hostname within your configuration. This value can be referenced within your Ansible playbooks and roles. On the other hand, ansible_host is used to specify the IP address or domain name where Ansible should establish a connection to the remote host.

When setting these values in the format private_ip_address ansible_host=public_ip_address, Ansible will:

Use the private_ip_address internally within its playbooks and roles (the IP addresses specified as inventory_hostname will be used by the cluster components for listening), and connect to the host via the public_ip_address.

Example:

[etcd_cluster]
10.128.64.140 ansible_host=34.72.80.145
10.128.64.142 ansible_host=35.123.45.67
10.128.64.143 ansible_host=36.192.89.10

This configuration is useful when the cluster components need to communicate over internal IP addresses, but Ansible commands need to be run over the public IP address.

vitabaks commented 11 months ago

using inventory_hostname was a simple and fast way to implement listening settings by cluster components on the specified network.

It may be worth abandoning this method in favor of the bind_address variable (similar to consul_bind_address) for the interface designated in the interface variable (similar to vip_interface of consul_iface)

olwe0002 commented 11 months ago

Use of Internal and External IP Addresses in Ansible Inventory

It has been identified that there may be some confusion when it comes to using both internal and external IP addresses within the Ansible inventory. Here is some clarification:

In Ansible, the inventory_hostname represents the hostname within your configuration. This value can be referenced within your Ansible playbooks and roles. On the other hand, ansible_host is used to specify the IP address or domain name where Ansible should establish a connection to the remote host.

When setting these values in the format private_ip_address ansible_host=public_ip_address, Ansible will:

Use the private_ip_address internally within its playbooks and roles (the IP addresses specified as inventory_hostname will be used by the cluster components for listening), and connect to the host via the public_ip_address.

Example:

[etcd_cluster]
10.128.64.140 ansible_host=34.72.80.145
10.128.64.142 ansible_host=35.123.45.67
10.128.64.143 ansible_host=36.192.89.10

This configuration is useful when the cluster components need to communicate over internal IP addresses, but Ansible commands need to be run over the public IP address.

thx, I just tested this proposal, and it did work fine without any further configuration needed !

fatmaAliGamal commented 10 months ago

[etcd_cluster] 10.0.4.28 ansible_host=18.205.150.32 10.0.4.180 ansible_host=18.208.144.178 10.0.5.139 ansible_host=18.207.156.126

[master] 10.0.4.28 ansible_host=18.205.150.32

[replica] 10.0.4.180 ansible_host=18.208.144.178 10.0.5.139 ansible_host=18.207.156.126

but still error appears TASK [sysctl : Setting kernel parameters] *** fatal: [10.0.4.28]: FAILED! => {"msg": "Failed to connect to the host via ssh: "} ...ignoring fatal: [10.0.4.180]: FAILED! => {"msg": "Failed to connect to the host via ssh: "} ...ignoring fatal: [10.0.5.139]: FAILED! => {"msg": "Failed to connect to the host via ssh: "} ...ignoring

TASK [etcd : Make sure the unzip/tar packages are present] ** fatal: [10.0.4.28]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ", "unreachable": true} fatal: [10.0.4.180]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ", "unreachable": true} fatal: [10.0.5.139]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ", "unreachable": true}

NO MORE HOSTS LEFT **

PLAY RECAP ** 10.0.4.180 : ok=6 changed=2 unreachable=1 failed=0 skipped=31 rescued=0 ignored=1
10.0.4.28 : ok=6 changed=2 unreachable=1 failed=0 skipped=31 rescued=0 ignored=1
10.0.5.139 : ok=6 changed=2 unreachable=1 failed=0 skipped=31 rescued=0 ignored=1
I use aws ec2 ubuntu 22.04