kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.17k stars 6.48k forks source link

Kubespray fails to upgrade to 1.31.1 from v1.30.4 #11571

Closed ubersol closed 1 month ago

ubersol commented 1 month ago

What happened?

Executed : ansible-playbook -i inventory/mycluster/inventory.ini --become-user=root upgrade-cluster.yml | tee -a upgrade_cluster_to_1-31-1.txt to upgrade the cluster to 1.31.4 after changing the variable kube_version: v1.31.1 in inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml and the upgrade of failed with the following messages:

RUNNING HANDLER [kubernetes/node : Kubelet | restart kubelet] ******************
changed: [myapp10]
Wednesday 25 September 2024  04:15:45 -0500 (0:00:00.729)       0:05:18.234 ***

TASK [kubernetes/node : Enable kubelet] ****************************************
ok: [myapp10]
Wednesday 25 September 2024  04:15:45 -0500 (0:00:00.357)       0:05:18.592 ***

TASK [kubernetes/kubeadm_common : Kubeadm | Create directory to store kubeadm patches] ***
changed: [myapp10]
Wednesday 25 September 2024  04:15:46 -0500 (0:00:00.223)       0:05:18.816 ***

TASK [kubernetes/kubeadm_common : Kubeadm | Copy kubeadm patches from inventory files] ***
fatal: [myapp10]: FAILED! => {"msg": "Invalid data passed to 'loop', it requires a list, got this instead: {'enabled': False, 'source_dir': '/root/Ansible/kubespray/inventory/mycluster/patches', 'dest_dir': '/etc/kubernetes/patches'}. Hint: If you passed a list/dict of just one element, try adding wantlist=True to your lookup invocation or use q/query instead of lookup."}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
myapp10            : ok=390  changed=21   unreachable=0    failed=1    skipped=672  rescued=0    ignored=0
myapp11            : ok=276  changed=10   unreachable=0    failed=0    skipped=223  rescued=0    ignored=0
myapp12            : ok=276  changed=10   unreachable=0    failed=0    skipped=223  rescued=0    ignored=0
myapp13            : ok=234  changed=6    unreachable=0    failed=0    skipped=189  rescued=0    ignored=0
myapp14            : ok=234  changed=6    unreachable=0    failed=0    skipped=184  rescued=0    ignored=0
myapp15            : ok=234  changed=6    unreachable=0    failed=0    skipped=184  rescued=0    ignored=0

Wednesday 25 September 2024  04:15:46 -0500 (0:00:00.015)       0:05:18.831 ***
===============================================================================
kubernetes/preinstall : Install packages requirements ------------------ 75.32s
download : Download_file | Download item ------------------------------- 10.60s
download : Download_file | Download item ------------------------------- 10.43s
download : Download_file | Download item -------------------------------- 9.94s
download : Download_file | Download item -------------------------------- 8.91s
upgrade/pre-upgrade : Drain node ---------------------------------------- 7.30s
download : Download_file | Download item -------------------------------- 5.51s
download : Download_file | Download item -------------------------------- 5.22s
Gathering Facts --------------------------------------------------------- 4.97s
container-engine/containerd : Containerd | Unpack containerd archive ---- 3.84s
container-engine/containerd : Download_file | Download item ------------- 3.40s
container-engine/validate-container-engine : Populate service facts ----- 3.36s
container-engine/crictl : Download_file | Download item ----------------- 3.35s
container-engine/runc : Download_file | Download item ------------------- 3.30s
container-engine/nerdctl : Download_file | Download item ---------------- 3.29s
container-engine/runc : Runc | Uninstall runc package managed by package manager --- 3.12s
container-engine/containerd : Containerd | Remove any package manager controlled containerd package --- 3.12s
download : Extract_file | Unpacking archive ----------------------------- 2.99s
container-engine/crictl : Extract_file | Unpacking archive -------------- 2.89s
container-engine/nerdctl : Extract_file | Unpacking archive ------------- 2.71s

After this listing the nodes:

04:08:28 # kubectl get nodes
NAME              STATUS                     ROLES           AGE   VERSION
myapp10   Ready,SchedulingDisabled   control-plane   61d   v1.31.1
myapp11   Ready                      control-plane   61d   v1.30.4
myapp12   Ready                      control-plane   61d   v1.30.4
myapp13   Ready                      <none>          61d   v1.30.4
myapp14   Ready                      <none>          61d   v1.30.4
myapp15   Ready                      <none>          61d   v1.30.4

What did you expect to happen?

I expected the upgrade process not fail immediately and finish

How can we reproduce it (as minimally and precisely as possible)?

Upgrade from 1.30.4 to 1.31.1 and change inventory variable

OS

04:31:09 # printf "$(uname-srm)\n$(cat /etc/os-release)\n" -bash: uname-srm: command not found

NAME="Oracle Linux Server" VERSION="8.10" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="8.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Oracle Linux Server 8.10" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:8:10:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8" ORACLE_BUGZILLA_PRODUCT_VERSION=8.10 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=8.10

Version of Ansible

04:30:54 # ansible --version
ansible [core 2.16.11]
  config file = /root/Ansible/kubespray/ansible.cfg
  configured module search path = ['/root/Ansible/kubespray/library']
  ansible python module location = /root/py311env/lib64/python3.11/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/py311env/bin/ansible
  python version = 3.11.9 (main, Jul  2 2024, 17:31:52) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22.0.1)] (/root/py311env/bin/python3.11)
  jinja version = 3.1.4
  libyaml = True

Version of Python

04:32:17 # python --version Python 3.11.9

Version of Kubespray (commit)

15bb5b078

Network plugin used

flannel

Full inventory with variables

Unable to get this directly

Command used to invoke ansible

ansible-playbook -i inventory/mycluster/inventory.ini --become-user=root upgrade-cluster.yml | tee -a upgrade_cluster_to_1-31-1.txt

Output of ansible run

https://gist.github.com/ubersol/82ecde969062b03df65daa89fcccc32c

Anything else we need to know

No response

ubersol commented 1 month ago

I rerun pip3 install -U -r requirements.txt just to make sure I have the latest ansible version:

ansible [core 2.16.11]
  config file = /root/Ansible/kubespray/ansible.cfg
  configured module search path = ['/root/Ansible/kubespray/library']
  ansible python module location = /root/py311env/lib64/python3.11/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/py311env/bin/ansible
  python version = 3.11.9 (main, Jul  2 2024, 17:31:52) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22.0.1)] (/root/py311env/bin/python3.11)
  jinja version = 3.1.4
  libyaml = True

and rerun the upgrade process, it did fail at the exact same spot with the same message.

VannTen commented 1 month ago

the kubeadm_patches changed format in #11521 , you need to adjust your inventory.

ubersol commented 1 month ago

Hi @VannTen , thank you for the quick response, but I am not sure I understand the format change mentioned in #11521 .. Currently my kubespray/inventory/myapp/inventory.ini looks like this:

[all]
myapp10 ansible_host=7.40.11.52 ansible_python_interpreter=/usr/bin/python3.6 ansible_become=true
myapp11 ansible_host=7.40.10.227 ansible_become=true
myapp12 ansible_host=7.40.11.14 ansible_become=true
myapp13 ansible_host=7.40.11.3  ansible_become=true
myapp14 ansible_host=7.40.11.51 ansible_become=true
myapp15 ansible_host=7.40.11.46 ansible_become=true

[kube_control_plane]
myapp10
myapp11
myapp12

[etcd]
myapp10
myapp11
myapp12

[kube_node]
myapp13
myapp14
myapp15

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

and from the https://github.com/kubernetes-sigs/kubespray/blob/master/inventory/sample/inventory.ini, it looks like there is etc_member added like:

etcd_member_name=

Do I also need to place this inside the [all] section like below ?

[all]
myapp10 ansible_host=7.40.11.52 ansible_python_interpreter=/usr/bin/python3.6 ansible_become=true etcd_member_name=etcd1
myapp11 ansible_host=7.40.10.227 ansible_become=true etcd_member_name=etcd2
myapp12 ansible_host=7.40.11.14 ansible_become=true etcd_member_name=etcd3
myapp13 ansible_host=7.40.11.3  ansible_become=true
myapp14 ansible_host=7.40.11.51 ansible_become=true
myapp15 ansible_host=7.40.11.46 ansible_become=true

Could you be so kind to clarify this a little bit? Also, is there a documentation about this change ( I am sorry I looked, but can't find it ) ?

ubersol commented 1 month ago

Hi @VannTen ,

I tried the following in the inventory file:

[all]
myapp10 ansible_host=7.40.11.52 ansible_python_interpreter=/usr/bin/python3.6 ansible_become=true etcd_member_name=etcd1
myapp11 ansible_host=7.40.10.227 ansible_become=true etcd_member_name=etcd2
myapp12 ansible_host=7.40.11.14 ansible_become=true etcd_member_name=etcd3
myapp13 ansible_host=7.40.11.3  ansible_become=true
myapp14 ansible_host=7.40.11.51 ansible_become=true
myapp15 ansible_host=7.40.11.46 ansible_become=true

but again, it failed with :

TASK [kubernetes/kubeadm_common : Kubeadm | Copy kubeadm patches from inventory files] ***
fatal: [myapp10]: FAILED! => {"msg": "Invalid data passed to 'loop', it requires a list, got this instead: {'enabled': False, 'source_dir': '/root/Ansible/kubespray/inventory/mycluster/patches', 'dest_dir': '/etc/kubernetes/patches'}. Hint: If you passed a list/dict of just one element, try adding wantlist=True to your lookup invocation or use q/query instead of lookup."}
VannTen commented 1 month ago

You've copied the sample inventory some times ago. In there, there is a kubeam_patches variable whose format has changed with the above PR. If you don't need it, just delete that variable, otherwise, adapt it to the new format.

The docs are linked to in the PR release notes, not sure what more I can do :shrug:

VannTen commented 1 month ago

If you need further support, please go to the kubespray slack channel which is for that purpose, thanks !

/close /kind support

k8s-ci-robot commented 1 month ago

@VannTen: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/kubespray/issues/11571#issuecomment-2376536805): >If you need further support, please go to the kubespray slack channel which is for that purpose, thanks ! > >/close >/kind support > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
ubersol commented 1 month ago

You've copied the sample inventory some times ago. In there, there is a kubeam_patches variable whose format has changed with the above PR. If you don't need it, just delete that variable, otherwise, adapt it to the new format.

The docs are linked to in the PR release notes, not sure what more I can do 🤷

Ah yes, that's right I did copy the sample inventory back in v1.29 something...I misunderstood what you said about inventory, I thought you meant the inventory file :). Thank you for your help!, greatly appreciate it. I edited the the file : inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml with the following changes before rerunning it:

# kubeadm patches path
kubeadm_patches_dir: "{{ kube_config_dir }}/patches"
kubeadm_patches: []
#kubeadm_patches:
#  enabled: false
#  source_dir: "{{ inventory_dir }}/patches"
#  dest_dir: "{{ kube_config_dir }}/patches"

which is I think what you meant! and lastly just out of curiosity, is it the best practice to cp the sample inventory directory each time we have a new version with kubespray and then do the customizations? Thanks again for your help and time. I was able to upgrade without any issues.

VannTen commented 1 month ago

and lastly just out of curiosity, is it the best practice to cp the sample inventory directory each time we have a new version with kubespray and then do the customizations? Thanks again for your help and time.

Nope, the best practice is to only put in your inventory what you need, an nothing more, and never copy the sample inventory at all.

/rant on This is a sore point of our documentation, and it cause the same kind of misunderstanding you had quite regularly, unfortunately ^ see #10645 or #10697 /rant off