kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.06k stars 6.45k forks source link

Single node upgrades are broken by container_manager replacement logic #8609

Closed cristicalin closed 2 years ago

cristicalin commented 2 years ago

Environment:

Kubespray version (commit) (git rev-parse --short HEAD):

0fc453fe

Network plugin used:

calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

[all]
ubuntu-nuc-00.kaveman.intra ansible_connection=local local_as=64512

[all:vars]
upgrade_cluster_setup=True
force_certificate_regeneration=True
etcd_kubeadm_enabled=True
download_container=False
peer_with_router=False

[kube_control_plane]
ubuntu-nuc-00.kaveman.intra

[kube_control_plane:vars]

[etcd:children]
kube_control_plane

[kube_node]
ubuntu-nuc-00.kaveman.intra

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

[k8s_cluster:vars]
kube_version=v1.22.7
calico_version=v3.22.0
helm_enabled=True
metrics_server_enabled=True
ingress_nginx_enabled=True
cert_manager_enabled=False
metallb_enabled=True
metallb_speaker_enabled=False
metallb_protocol=bgp
metallb_controller_tolerations=[{'effect':'NoSchedule','key':'node-role.kubernetes.io/master'},{'effect':'NoSchedule','key':'node-role.kubernetes.io/control-plane'}]
metallb_ip_range=["10.5.0.0/16"]
kube_proxy_strict_arp=True
kube_encrypt_secret_data=True
container_manager=containerd
kubernetes_audit=True
calico_datastore="kdd"
calico_iptables_backend="NFT"
calico_advertise_cluster_ips=True
calico_felix_prometheusmetricsenabled=True
calico_ipip_mode=Never
calico_vxlan_mode=Never
calico_advertise_service_loadbalancer_ips=["10.5.0.0/16"]
calico_ip_auto_method: "interface=eno1""
kube_network_plugin_multus=True
kata_containers_enabled=False
runc_version=v1.1.0
typha_enabled=True
nodelocaldns_external_zones=[{'cache': 30,'zones':['kaveman.intra'],'nameservers':['192.168.0.1']}]
nodelocaldns_bind_metrics_host_ip=True
csi_snapshot_controller_enabled=True
deploy_netchecker=True
krew_enabled=True

Command used to invoke ansible:

ansible-playbook -i ../inventory.ini cluster.yml -vvv

Output of ansible run:

Ansible run breaks cordoning off the node since the container-engine/validate-container-engine role calls the remove-node/pre-remove node each time. There seems to be an issue in the container_manager detection logic which causes this role to trigger the action even though the container_manager has not been changed.

Anything else do we need to know:

This container_manager replacement logic should be tested in CI to ensure it works properly and does not break existing deployments before we tag 2.19.

/cc @cyril-corbon

cristicalin commented 2 years ago

Upon further diagnosis it looks like to container manager detection logic may be at fault when an old container_manager has not been properly cleaned up. In my case I had docker.service files laying around from a previous install. I think also a service check should be performed to see if the service is actually running.

Another symptom which I was unable to reproduce is that the cluster.yaml run actually triggered a re-install of the docker engine which I cannot quite explain.