kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.03k stars 6.45k forks source link

Installation against Ubuntu 24.04 fails at `kubeadm token create` step #11626

Open spantaleev opened 1 day ago

spantaleev commented 1 day ago

What happened?

TASK [kubernetes/kubeadm : Create kubeadm token for joining nodes with 24h expiration (default)] *************************************************************************************************************************************
fatal: [worker-2 -> control-plane-0(131.186.62.86)]: FAILED! => {"changed": false, "cmd": ["/usr/local/bin/kubeadm", "token", "create"], "delta": "0:00:00.019550", "end": "2024-10-11 07:30:29.269963", "msg": "non-zero return code", "rc": 1, "start": "2024-10-11 07:30:29.250413", "stderr": "failed to load admin kubeconfig: open /root/.kube/config: no such file or directory\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["failed to load admin kubeconfig: open /root/.kube/config: no such file or directory", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "", "stdout_lines": []}

What did you expect to happen?

A successful completion of the run.

How can we reproduce it (as minimally and precisely as possible)?

I've used v2.26.0 (via the quay.io/kubespray/kubespray:v2.26.0 container image) against Ubuntu 24.04 Minimal hosts (x86_64) on Oracle Cloud.

OS

For the target nodes:

Linux 6.8.0-1013-oracle x86_64
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Version of Ansible

ansible [core 2.16.10]
  config file = /kubespray/ansible.cfg
  configured module search path = ['/kubespray/library']
  ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (/usr/bin/python3)
  jinja version = 3.1.4
  libyaml = True

Version of Python

Python 3.10.12

Version of Kubespray (commit)

v2.26.0

Network plugin used

calico

Full inventory with variables

Pretty much the sample inventory with a few overrides.

My hosts.yaml file looks like this:

all:
  hosts:
    control-plane-0:
      ansible_host: 1.1.1.1
      ip: 10.0.1.1
      access_ip: 10.0.1.1
      ansible_user: ubuntu
    worker-0:
      ansible_host: 2.2.2.2
      ip: 10.0.1.2
      access_ip: 10.0.1.2
      ansible_user: ubuntu
    worker-1:
      ansible_host: 3.3.3.3
      ip: 10.0.1.3
      access_ip: 10.0.1.3
      ansible_user: ubuntu
    worker-2:
      ansible_host: 4.4.4.4
      ip: 10.0.1.4
      access_ip: 10.0.1.4
      ansible_user: ubuntu
  children:
    kube_control_plane:
      hosts:
        control-plane-0:
    kube_node:
      hosts:
        worker-0:
        worker-1:
        worker-2:
    etcd:
      hosts:
        control-plane-0:
        worker-0:
        worker-1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Command used to invoke ansible

ansible-playbook -i /path/to/inventory/hosts.yaml --extra-vars @/path/to/vault.yaml --ask-vault-pass --become /kubespray/cluster.yml

Output of ansible run

https://gist.github.com/spantaleev/867461ebf1762fae3b51bf410ad148e0

Anything else we need to know

v2.26.0 with the same inventory configuration works flawlessly against Rocky Linux 9 hosts. I also manage another cluster (built on top of Ubuntu 22.04) with a very similar inventory and that one is also OK.

However, for some reason, targeting Ubuntu 24.04 fails.


The error seems to be coming from extra_playbooks/roles/kubernetes/kubeadm/tasks/main.yml. That said, extra_playbooks/roles/kubernetes/control-plane/tasks/kubeadm-setup.yml seems to contain similar kubeadm token create tasks (which also pass --kubeconfig {{ kube_config_dir }}/admin.conf).

I am not sure I fully understand how things work, but it appears that kubeadm token create is invoked against the control-plane node before kubeadm init has had a chance to run, which may be causing the problem.

In an effort to debug the problem, I have tried adding an untagged fail: msg="Stop" task at the beginning of extra_playbooks/roles/kubernetes/control-plane/tasks/kubeadm-setup.yml, but it doesn't seem to run at all. It looks like extra_playbooks/roles/kubernetes/kubeadm/tasks/main.yml runs, but extra_playbooks/roles/kubernetes/control-plane/tasks/kubeadm-setup.yml doesn't.


After this failure, if I re-run Kubespray once again, I'm hitting different errors (related to not being able to form the etcd cluster or something). Retrying results in the same etcd errors.


I've tried master (currently b4768cfa9137b9e85fc786b6c5b0075d93ac2edb) locally (not via a container image) against brand new Ubuntu 24.04 hosts and it failed with another error:

TASK [kubernetes/kubeadm_common : Kubeadm | Copy kubeadm patches from inventory files] ***********************************************************************************************************************************************
fatal: [worker-2]: FAILED! => {"msg": "Invalid data passed to 'loop', it requires a list, got this instead: {'enabled': False, 'source_dir': '/kubespray-inventory/patches', 'dest_dir': '/etc/kubernetes/patches'}. Hint: If you passed a list/dict of just one element, try adding wantlist=True to your lookup invocation or use q/query instead of lookup."}

I suppose it's currently broken somehow.

tico88612 commented 1 day ago

The logs you provided contain many strange entries (e.g., etcd failed to start, control-plane installation didn't run, etc.). CI has also tested it on Ubuntu 24.04, and it looks very much like an unstable network or a problem with the host's settings.

FYI (Ubuntu 24.04 CI test log): https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/8059956574/viewer

/remove-kind bug /kind support