kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.22k stars 6.5k forks source link

Unable to start service etcd: Job for etcd.service - resolved urls do not match #11560

Open Uncurlhalo opened 2 months ago

Uncurlhalo commented 2 months ago

What happened?

While attempting to stand up a k8s cluster with 3 control-plane nodes and 3 workers, the playbook failed when ensuring that etcd is running.

What did you expect to happen?

The playbook to successfully complete, etcd to be running on the control plane.

How can we reproduce it (as minimally and precisely as possible)?

Install kubespray via ansible-galaxy per documentation and create a playbook, install-cluster.yml with the following contents.

- name: Install Kubernetes
  ansible.builtin.import_playbook: kubernetes_sigs.kubespray.cluster

Create 6 VM's, provision each VM with one public IP and one private IP (in my use case public meaning public to my LAN and not a host only network on my hypervisor).

Define these IP's as show in my ansible inventory.

Execute the playbook with the command ansible-playbook -i invetory/invetory.ini --become --become-user=root cluster-install.yml

OS

Linux 6.10.9-200.fc40.x86_64 x86_64 NAME="Fedora Linux" VERSION="40 (Workstation Edition)" ID=fedora VERSION_ID=40 VERSION_CODENAME="" PLATFORM_ID="platform:f40" PRETTY_NAME="Fedora Linux 40 (Workstation Edition)" ANSI_COLOR="0;38;2;60;110;180" LOGO=fedora-logo-icon CPE_NAME="cpe:/o:fedoraproject:fedora:40" DEFAULT_HOSTNAME="fedora" HOME_URL="https://fedoraproject.org/" DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f40/system-administrators-guide/" SUPPORT_URL="https://ask.fedoraproject.org/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=40 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=40 SUPPORT_END=2025-05-13 VARIANT="Workstation Edition" VARIANT_ID=workstation

Version of Ansible

ansible [core 2.16.10] config file = /etc/ansible/ansible.cfg configured module search path = ['/home/jmelton/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python3.12/site-packages/ansible ansible collection location = /home/jmelton/.ansible/collections:/usr/share/ansible/collections executable location = /usr/bin/ansible python version = 3.12.5 (main, Aug 23 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] (/usr/bin/python3) jinja version = 3.1.4 libyaml = True

Version of Python

Python 3.12.5

Version of Kubespray (commit)

89ff0710e

Network plugin used

calico

Full inventory with variables

[all] k8s-control-plane-0 ansible_host=192.168.1.200 ansible_user=k8s-node ip=192.168.1.200 access_ip=10.0.1.100 k8s-control-plane-1 ansible_host=192.168.1.201 ansible_user=k8s-node ip=192.168.1.201 access_ip=10.0.1.101 k8s-control-plane-2 ansible_host=192.168.1.202 ansible_user=k8s-node ip=192.168.1.202 access_ip=10.0.1.102 k8s-worker-node-0 ansible_host=192.168.1.210 ansible_user=k8s-node ip=192.168.1.210 access_ip=10.0.1.200 k8s-worker-node-1 ansible_host=192.168.1.211 ansible_user=k8s-node ip=192.168.1.211 access_ip=10.0.1.201 k8s-worker-node-2 ansible_host=192.168.1.212 ansible_user=k8s-node ip=192.168.1.212 access_ip=10.0.1.202

[kube_control_plane] k8s-control-plane-0 k8s-control-plane-1 k8s-control-plane-2

[etcd] k8s-control-plane-0 k8s-control-plane-1 k8s-control-plane-2

[kube_node] k8s-worker-node-0 k8s-worker-node-1 k8s-worker-node-2

[k8s_cluster:children] kube_node kube_control_plane

Command used to invoke ansible

ansible-playbook -i inventory/inventory.ini --become --become-user=root cluster-install.yml

Output of ansible run

TASK [kubernetes_sigs.kubespray.etcd : Configure | Ensure etcd is running] *****************************************************************************************************************************************************************************************************************************************************************************************
fatal: [k8s-control-plane-2]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
fatal: [k8s-control-plane-1]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
fatal: [k8s-control-plane-0]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************************************************************************************************
k8s-control-plane-0        : ok=414  changed=82   unreachable=0    failed=1    skipped=422  rescued=0    ignored=1   
k8s-control-plane-1        : ok=403  changed=79   unreachable=0    failed=1    skipped=403  rescued=0    ignored=1   
k8s-control-plane-2        : ok=403  changed=79   unreachable=0    failed=1    skipped=403  rescued=0    ignored=1   
k8s-worker-node-0          : ok=352  changed=56   unreachable=0    failed=0    skipped=434  rescued=0    ignored=0   
k8s-worker-node-1          : ok=327  changed=56   unreachable=0    failed=0    skipped=377  rescued=0    ignored=0   
k8s-worker-node-2          : ok=327  changed=56   unreachable=0    failed=0    skipped=377  rescued=0    ignored=0   

Anything else we need to know

A similar issue was opened here but the user never triaged the problem. Additionally relevant information from the control-plane nodes.

Output of journalctl -xeu etcd.service

Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706328-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ADVERTISE_CLIENT_URLS","variable-value":"https://10.0.1.100:2379"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706572-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"8"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706607-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CERT_FILE","variable-value":"/etc/ssl/etcd/ssl/member-k8s-control-plane-0.pem"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706627-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CLIENT_CERT_AUTH","variable-value":"true"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706653-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_DATA_DIR","variable-value":"/var/lib/etcd"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706676-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ELECTION_TIMEOUT","variable-value":"5000"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706696-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ENABLE_V2","variable-value":"true"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706732-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK","variable-value":"True"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706761-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_EXPERIMENTAL_WATCH_PROGRESS_NOTIFY_INTERVAL","variable-value":"5s"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706785-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_HEARTBEAT_INTERVAL","variable-value":"250"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.70681-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"https://10.0.1.100:2380"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706827-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd1=https://192.168.1.200:2380,etcd2=https://192.168.1.201:2380,etcd3=https://192.168.1.202:2380"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706845-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"new"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706868-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_TOKEN","variable-value":"k8s_etcd"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706886-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_KEY_FILE","variable-value":"/etc/ssl/etcd/ssl/member-k8s-control-plane-0-key.pem"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706923-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_CLIENT_URLS","variable-value":"https://192.168.1.200:2379,https://127.0.0.1:2379"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706948-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_PEER_URLS","variable-value":"https://192.168.1.200:2380"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.706991-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_METRICS","variable-value":"basic"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707008-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"etcd1"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707028-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_CERT_FILE","variable-value":"/etc/ssl/etcd/ssl/member-k8s-control-plane-0.pem"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707045-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_CLIENT_CERT_AUTH","variable-value":"True"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707067-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_KEY_FILE","variable-value":"/etc/ssl/etcd/ssl/member-k8s-control-plane-0-key.pem"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707083-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_TRUSTED_CA_FILE","variable-value":"/etc/ssl/etcd/ssl/ca.pem"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707102-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PROXY","variable-value":"off"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707163-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"10000"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707191-0400","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_TRUSTED_CA_FILE","variable-value":"/etc/ssl/etcd/ssl/ca.pem"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"warn","ts":"2024-09-21T13:09:42.7074-0400","caller":"embed/config.go:679","msg":"Running http and grpc server on single port. This is not recommended for production."}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707472-0400","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/usr/local/bin/etcd"]}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707648-0400","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"warn","ts":"2024-09-21T13:09:42.70771-0400","caller":"embed/config.go:679","msg":"Running http and grpc server on single port. This is not recommended for production."}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707734-0400","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.1.200:2380"]}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.707794-0400","caller":"embed/etcd.go:494","msg":"starting with peer TLS","tls-info":"cert = /etc/ssl/etcd/ssl/member-k8s-control-plane-0.pem, key = /etc/ssl/etcd/ssl/member-k8s-control-plane-0-key.pem, client-cert=, client-key=, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file = ","cipher-suites":[]}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.709855-0400","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.1.200:2379"]}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.710197-0400","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.12","git-sha":"e7b3bb6cc","go-version":"go1.20.13","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":false,"name":"etcd1","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"250ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.0.1.100:2380"],"listen-peer-urls":["https://192.168.1.200:2380"],"advertise-client-urls":["https://10.0.1.100:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.1.200:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"etcd1=https://192.168.1.200:2380,etcd2=https://192.168.1.201:2380,etcd3=https://192.168.1.202:2380","initial-cluster-state":"new","initial-cluster-token":"k8s_etcd","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"8h0m0s","auto-compaction-interval":"8h0m0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.713733-0400","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"3.030533ms"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.715255-0400","caller":"netutil/netutil.go:112","msg":"resolved URL Host","url":"https://10.0.1.100:2380","host":"10.0.1.100:2380","resolved-addr":"10.0.1.100:2380"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.715354-0400","caller":"netutil/netutil.go:112","msg":"resolved URL Host","url":"https://192.168.1.200:2380","host":"192.168.1.200:2380","resolved-addr":"192.168.1.200:2380"}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.722485-0400","caller":"embed/etcd.go:375","msg":"closing etcd server","name":"etcd1","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://10.0.1.100:2380"],"advertise-client-urls":["https://10.0.1.100:2379"]}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"info","ts":"2024-09-21T13:09:42.722713-0400","caller":"embed/etcd.go:377","msg":"closed etcd server","name":"etcd1","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://10.0.1.100:2380"],"advertise-client-urls":["https://10.0.1.100:2379"]}
Sep 21 13:09:42 k8s-control-plane-0 etcd[11167]: {"level":"fatal","ts":"2024-09-21T13:09:42.722785-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"--initial-cluster has etcd1=https://192.168.1.200:2380 but missing from --initial-advertise-peer-urls=https://10.0.1.100:2380 (resolved urls: \"https://10.0.1.100:2380\" != \"https://192.168.1.200:2380\")","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}
Sep 21 13:09:42 k8s-control-plane-0 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE

The only thing of note seems to be resolved urls: \"https://10.0.1.100:2380\" != \"https://192.168.1.200:2380\. I'm going to attempt recreation of the cluster with only the public-ip's defined via ansible-host, but I'm pretty sure that was working fine the prior day.

Uncurlhalo commented 2 months ago

Just an update, if the cluster is created without access_ip defined everything works as expected.

pavlozt commented 1 month ago

I have the same problems when mixed using internal ip and real access_ip in cloud environment

kubespray commit 0b64ab1 ( tag v2.24.3) ansible core 2.15.12 python 3.11.3