Unable to start service etcd

RodriMora commented 3 months ago

Hi!

I'm having this problem when trying to deploy the etcd cluster:

TASK` [etcd : Enable and start etcd service] ********************************************************************************************************************************
fatal: [192.168.10.189]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
fatal: [192.168.10.71]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
fatal: [192.168.10.124]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}

When running "sudo journalctl -u etcd -n 100 --output=short-precise" in the nodes this is the output:

{"level":"fatal","ts":"2024-05-31T21:57:31.069579+0200","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"--initial-cluster has etcd=http://192.168.10.71:2380,etcd=http://192.168.10.124:2380 but missing from --initial-advertise-peer-urls=http://192.168.10.189:2380 (len([\"http://192.168.10.189:2380\"]) != len([\"http://192.168.10.124:2380\" \"http://192.168.10.189:2380\" \"http://192.168.10.71:2380\"]))","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}

And this are the IP's in the inventory:

[etcd_cluster] # recommendation: 3, or 5-7 nodes 192.168.10.189 192.168.10.71 192.168.10.124

What could be the cause?

vitabaks commented 3 months ago

It's strange, this is how the configuration should be:

https://github.com/vitabaks/postgresql_cluster/blob/master/roles/etcd/templates/etcd.conf.j2

Please provide an archive of your variables and inventory for analysis.

RodriMora commented 3 months ago

Fixed it. Seems like the hostnames need to be different. I added the hostname in the inventory and now it works:

[etcd_cluster] # recommendation: 3, or 5-7 nodes 192.168.10.189 hostname=etcd1 192.168.10.71 hostname=etcd2 192.168.10.124 hostname=etcd3

It's not there by default, I might submit a PR to add it.

Edit: the servers were cloned to save time and they had the same hostname.

vitabaks commented 3 months ago

Make sure that the servers have different hostnames. You can specify hostname variable for each node in the invetory to set a new hostname, it must be unique.

vitabaks / postgresql_cluster

Unable to start service etcd #670