kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.16k stars 6.48k forks source link

etcd, modprobe_conntrack_module, kubectl fail #11340

Open dennisTGC opened 4 months ago

dennisTGC commented 4 months ago

What happened?

etcd:

fatal: [k8s-m01]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcd --version", "msg": "[Errno 2] No such file or directory: b'/usr/local/bin/etcd'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

modprobe_conntrack_module:

fatal: [k8s-m01]: FAILED! => {"msg": "The conditional check '(modprobe_conntrack_module|default({'rc': 1})).rc != 0' failed. The error was: error while evaluating conditional ((modprobe_conntrack_module|default({'rc': 1})).rc != 0): 'dict object' has no attribute 'rc'. 'dict object' has no attribute 'rc'\n\nThe error appears to be in '/home/ansible/git/kubespray/roles/kubernetes/node/tasks/main.yml': line 126, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Modprobe conntrack module\n  ^ here\n"}

kubectl:

fatal: [k8s-m02]: FAILED! => {"changed": false, "cmd": ["/usr/local/bin/kubectl", "get", "nodes", "--selector=node-role.kubernetes.io/control-plane", "-o", "json"], "delta": "0:00:00.049756", "end": "2024-07-01 13:58:35.818375", "msg": "non-zero return code", "rc": 1, "start": "2024-07-01 13:58:35.768619", "stderr": "E0701 13:58:35.811809   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused\nE0701 13:58:35.812208   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused\nE0701 13:58:35.813474   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused\nE0701 13:58:35.813605   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused\nE0701 13:58:35.814814   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["E0701 13:58:35.811809   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused", "E0701 13:58:35.812208   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused", "E0701 13:58:35.813474   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused", "E0701 13:58:35.813605   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused", "E0701 13:58:35.814814   90009 memcache.go:265] couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused", "The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

What did you expect to happen?

That the errors would not happen.

How can we reproduce it (as minimally and precisely as possible)?

yes:

OS

NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy\n: No such file or directory

Version of Ansible

ansible [core 2.16.8] config file = None configured module search path = ['/home/ansible/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /home/ansible/git/kubespray-venv/lib/python3.10/site-packages/ansible ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections executable location = /home/ansible/git/kubespray-venv/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/home/ansible/git/kubespray-venv/bin/python3) jinja version = 3.1.4 libyaml = True

Version of Python

Python 3.10.12

Version of Kubespray (commit)

2e0008c3f

Network plugin used

calico

Full inventory with variables

[all] k8s-m01 ip=10.0.0.11 etcd_member_name=etcd1 k8s-m02 ip=10.0.0.12 etcd_member_name=etcd2 k8s-m03 ip=10.0.0.13 etcd_member_name=etcd3 k8s-node101 ip=10.0.0.101 k8s-node102 ip=10.0.0.102 k8s-node103 ip=10.0.0.103

[kube_control_plane] k8s-m01 k8s-m02 k8s-m03

[etcd] k8s-m01 k8s-m02 k8s-m03

[kube_node] k8s-node101 k8s-node102 k8s-node103

[calico_rr]

[k8s_cluster:children] kube_control_plane kube_node calico_rr

Command used to invoke ansible

/home/ansible/git/kubespray-venv/bin/ansible

Output of ansible run

..

Anything else we need to know

There are multiple bug issues but it seems they have been resolved. If they were resolved, probably somewhere the issue has been re-introduced.

Issues I found:

tico88612 commented 1 month ago

Looks like you didn't use become?

lemada01 commented 2 weeks ago

same problem here with using become option, command : ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root --user= --ask-become-pass cluster.yml

when i log to the node the /usr/local/bin/etcd is there and working,

any clue ?

thx

dennisTGC commented 1 week ago

Not sure what went wrong here, I did however encounter this issue https://github.com/kubernetes-sigs/kubespray/issues/11338 I can't replay it anymore.

So I'm closing this issue. I suggest @lemada01 to submit a new issue with all your details if you still encounter this issue. If not, let us know how you fixed it.

FilipeNas commented 1 week ago

@dennisTGC I am using RHEL 9.4 and had firewalld enabled by default. After disabling was able to complete the installation. How do I disable firewalld and use nftables service I did not enable the nftables.

Also kubespray provides this command to disable: ansible-playbook -e "{disable_service_firewall: true}" -i inventory/mycluster/inventory.ini --become --become-user=root contrib/os-services/os-services.yml

I found the error when i looked in to the logs using journalctl -xeu etcd.service and saw that etcd couldn't communicate with other nodes.