scale.yml race condition causing calico networking to malfunction

Rickkwa commented 8 months ago

What happened?

When running scale.yml, we are experiencing a race condition where sometimes /opt/cni/bin/calico is owned by the root user, and sometimes is owned by the kube user.

Due to the suid bit set by calico, when this binary is owned by the kube user, it lacks permissions to do everything it needs to do, and causes pods to be unable to schedule on this node.

-rwsr-xr-x 1 kube root 59136424 Jan 18 16:21 /opt/cni/bin/calico
#  ^ suid bit

Kubelet logs will then complain with errors such as:

Jan 19 14:31:54 myhostname kubelet[3077785]: E0119 14:31:54.400547 3077785 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"16bce6ca-50d7-48cf-86a9-6783044c43b9\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"854e397957fc263ee551570388b32f33a00fe808935d11800bde6f5805715b90\\\": plugin type=\\\"calico\\\" failed (delete): error loading config file \\\"/etc/cni/net.d/calico-kubeconfig\\\": open /etc/cni/net.d/calico-kubeconfig: permission denied\"" pod="jaeger/jaeger-agent-daemonset-bmvtq" podUID=16bce6ca-50d7-48cf-86a9-6783044c43b9

See "Anything else we need to know" section below for even more details and investigation.

What did you expect to happen?

Pods to be scheduling on the new node.

/opt/cni/bin/calico to be owned by root.

How can we reproduce it (as minimally and precisely as possible)?

Not exactly sure since this is a race condition. But if you want to experience the failure behavior, you can do on a worker node:

chown kube /opt/cni/bin/calico
chmod 4755 /opt/cni/bin/calico

Then check kubelet logs while you try to do some cluster scheduling operations.

I tried to add a sleep before where the owner gets changed, but it doesn't quite reproduce it. There is some other factor at play, I think related to calico-node pod start up process. I have a theory in the "Anything else we need to know" section below.

OS

Kubernetes worker:

Linux 5.18.15-1.el8.elrepo.x86_64 x86_64
NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8.5:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"

Ansible node: Alpine 3.14.2 docker container

Version of Ansible

ansible [core 2.14.14]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /root/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/.local/bin//ansible
  python version = 3.9.6 (default, Aug 27 2021, 23:46:59) [GCC 10.3.1 20210424] (/usr/local/bin/python3)
  jinja version = 3.1.3
  libyaml = True

Version of Python

Python 3.9.6

Version of Kubespray (commit)

3f6567bba01fffa53f256a85c0fa52f57dc88840 (aka v2.23.3)

Network plugin used

calico

Full inventory with variables

Vars for a worker node; scrubbed some stuff:

"addon_resizer_limits_cpu": "200m",
"addon_resizer_limits_memory": "50Mi",
"addon_resizer_requests_cpu": "100m",
"addon_resizer_requests_memory": "25Mi",
"apiserver_loadbalancer_domain_name": "test-cluster-api.test.example.com",
"argocd_apps_chart_version": "1.4.1",
"argocd_chart_version": "5.53.1",
"argocd_values_filename": "test-cluster-values.yaml",
"audit_log_maxage": 30,
"audit_log_maxbackups": 1,
"audit_log_maxsize": 100,
"audit_policy_file": "{{ kube_config_dir }}/audit-policy/apiserver-audit-policy.yaml",
"bin_dir": "/usr/local/bin",
"calico_apiserver_enabled": true,
"calico_felix_prometheusmetricsenabled": true,
"calico_ipip_mode": "Always",
"calico_iptables_backend": "Auto",
"calico_loglevel": "warning",
"calico_network_backend": "bird",
"calico_node_cpu_limit": "600m",
"calico_node_cpu_requests": "600m",
"calico_node_extra_envs": {
    "FELIX_MTUIFACEPATTERN": "^((en|wl|ww|sl|ib)[opsx].*|(eth|wlan|wwan|em|bond|p1|p2).*)"
},
"calico_node_memory_limit": "650Mi",
"calico_node_memory_requests": "650Mi",
"calico_policy_controller_cpu_limit": "300m",
"calico_policy_controller_cpu_requests": "300m",
"calico_policy_controller_memory_limit": "3000Mi",
"calico_policy_controller_memory_requests": "3000Mi",
"calico_pool_blocksize": 24,
"calico_vxlan_mode": "Never",
"cluster_name": "cluster.local",
"container_manager": "containerd",
"containerd_debug_level": "warn",
"containerd_extra_args": "[plugins.\"io.containerd.grpc.v1.cri\".registry.configs.\"regustry.example.com:5000\".auth]\n  auth = \"{{ scrubbed }}\"\n",
"containerd_registries": {
    "docker.io": "https://registry-1.docker.io"
},
"coredns_k8s_external_zone": "k8s_external.local",
"credentials_dir": "{{ inventory_dir }}/credentials",
"dashboard_enabled": false,
"default_kubelet_config_dir": "{{ kube_config_dir }}/dynamic_kubelet_dir",
"deploy_netchecker": false,
"dns_domain": "{{ cluster_name }}",
"dns_memory_limit": "250Mi",
"dns_memory_requests": "250Mi",
"dns_mode": "coredns",
"docker_image_repo": "{{ registry_host }}",
"drain_fallback_enabled": true,
"drain_fallback_grace_period": 0,
"drain_grace_period": 600,
"dynamic_kubelet_configuration": false,
"dynamic_kubelet_configuration_dir": "{{ kubelet_config_dir | default(default_kubelet_config_dir) }}",
"enable_coredns_k8s_endpoint_pod_names": false,
"enable_coredns_k8s_external": false,
"enable_ipv4_forwarding": true,
"enable_nodelocaldns": true,
"etcd_backup_retention_count": 5,
"etcd_data_dir": "/var/lib/etcd",
"etcd_deployment_type": "host",
"etcd_kubeadm_enabled": false,
"etcd_metrics": "extensive",
"event_ttl_duration": "1h0m0s",
"flush_iptables": false,
"gcr_image_repo": "{{ registry_host }}",
"github_image_repo": "{{ registry_host }}",
"group_names": [
    "k8s_cluster",
    "kube_node",
    "kubernetes_clusters"
],
"helm_deployment_type": "host",
"helm_enabled": true,
"inventory_hostname": "test-cluster-w-9.win.example3.com",
"inventory_hostname_short": "test-cluster-w-9",
"k8s_image_pull_policy": "IfNotPresent",
"kata_containers_enabled": false,
"kernel_devel_package": "kernel-ml-devel",
"kernel_headers_package": "kernel-ml-headers",
"kernel_package": "kernel-ml-5.18.15-1.el8.elrepo.x86_64",
"kernel_release": "5.18.15",
"kube_api_anonymous_auth": true,
"kube_api_pwd": "{{ lookup('password', credentials_dir + '/kube_user.creds length=15 chars=ascii_letters,digits') }}",
"kube_apiserver_insecure_port": 0,
"kube_apiserver_ip": "{{ kube_service_addresses|ipaddr('net')|ipaddr(1)|ipaddr('address') }}",
"kube_apiserver_port": 6443,
"kube_cert_dir": "{{ kube_config_dir }}/ssl",
"kube_cert_group": "kube-cert",
"kube_config_dir": "/etc/kubernetes",
"kube_encrypt_secret_data": false,
"kube_image_repo": "{{ registry_host }}",
"kube_log_level": 2,
"kube_manifest_dir": "{{ kube_config_dir }}/manifests",
"kube_network_node_prefix": 24,
"kube_network_plugin": "calico",
"kube_network_plugin_multus": false,
"kube_oidc_auth": true,
"kube_oidc_client_id": "test-cluster",
"kube_oidc_groups_claim": "groups",
"kube_oidc_url": "https://test-cluster-dex.test.example.com",
"kube_oidc_username_claim": "preferred_username",
"kube_oidc_username_prefix": "-",
"kube_pods_subnet": "10.233.64.0/18",
"kube_proxy_metrics_bind_address": "0.0.0.0:10249",
"kube_proxy_mode": "iptables",
"kube_proxy_nodeport_addresses": "{%- if kube_proxy_nodeport_addresses_cidr is defined -%} [{{ kube_proxy_nodeport_addresses_cidr }}] {%- else -%} [] {%- endif -%}",
"kube_proxy_strict_arp": false,
"kube_script_dir": "{{ bin_dir }}/kubernetes-scripts",
"kube_service_addresses": "10.233.0.0/18",
"kube_token_dir": "{{ kube_config_dir }}/tokens",
"kube_users": {
    "kube": {
        "groups": [
            "system:masters"
        ],
        "pass": "{{kube_api_pwd}}",
        "role": "admin"
    }
},
"kube_users_dir": "{{ kube_config_dir }}/users",
"kube_version": "v1.27.9",
"kubeadm_certificate_key": "{{ lookup('password', credentials_dir + '/kubeadm_certificate_key.creds length=64 chars=hexdigits') | lower }}",
"kubeadm_control_plane": false,
"kubelet_deployment_type": "host",
"kubelet_secure_addresses": "{%- for host in groups['kube_control_plane'] -%}\n  {{ hostvars[host]['ip'] | default(fallback_ips[host]) }}{{ ' ' if not loop.last else '' }}\n{%- endfor -%}",
"kubernetes_audit": true,
"loadbalancer_apiserver": {
    "address": "X.X.X.X",
    "port": 443
},
"local_release_dir": "/tmp/releases",
"metrics_server_cpu": "500m",
"metrics_server_enabled": true,
"metrics_server_limits_cpu": 1,
"metrics_server_limits_memory": "500Mi",
"metrics_server_memory": "300Mi",
"metrics_server_replicas": 2,
"metrics_server_requests_cpu": "500m",
"metrics_server_requests_memory": "300Mi",
"ndots": 2,
"nerdctl_enabled": true,
"networking_restart": false,
"node_labels": {
    "node-role.kubernetes.io/candidate-control-plane": "",
    "storage-node": "false",
    "topology.kubernetes.io/region": "XXX",
    "topology.kubernetes.io/zone": "XXX-YYY"
},
"nodelocaldns_cpu_requests": "100m",
"nodelocaldns_health_port": 9254,
"nodelocaldns_ip": "169.254.25.10",
"nodelocaldns_memory_limit": "200Mi",
"nodelocaldns_memory_requests": "200Mi",
"persistent_volumes_enabled": false,
"podsecuritypolicy_enabled": false,
"quay_image_repo": "{{ registry_host }}",
"reboot_timeout": 600,
"registry_host": "registry.example.com:5000",
"retry_stagger": 5,
"sealed_secrets_crt": "***********",
"sealed_secrets_ingress_class": "external",
"sealed_secrets_ingress_host": "test-cluster-sealed-secrets.test.example.com",
"sealed_secrets_key": "**********",
"skydns_server": "{{ kube_service_addresses|ipaddr('net')|ipaddr(3)|ipaddr('address') }}",
"skydns_server_secondary": "{{ kube_service_addresses|ipaddr('net')|ipaddr(4)|ipaddr('address') }}",
"ssl_client_cert_path": "/etc/pki/tls/certs/client.cert.pem",
"ssl_host_cert_path": "/etc/pki/tls/certs/host.cert.pem",
"ssl_host_key_path": "/etc/pki/tls/private/host.key.pem",
"ssl_root_cert_path": "/etc/pki/tls/certs/ca.cert.pem",
"upstream_dns_servers": [
    "8.8.8.8",
    "1.1.1.1"
],
"volume_cross_zone_attachment": false

Command used to invoke ansible

ansible-playbook -i /path/to/inventory/hosts.txt scale.yml -b --vault-password-file /path/to/vault/password --limit=$WORKER_NODE

Output of ansible run

I don't think is relevant with the info I provided below.

Anything else we need to know

When suid bit is set, and owner is kube, then my understanding is that the binary will always run as the kube user. Then when that happens, it cannot read from /etc/cni/net.d/calico-kubeconfig because of it's 600 permissions.

I believe the issue stems from scale.yml in this play. Specifically these two roles: kubernetes/kubeadm and network_plugin.

- name: Target only workers to get kubelet installed and checking in on any new nodes(network)
  #...
  roles:
    #...
    - { role: kubernetes/kubeadm, tags: kubeadm }
    #...
    - { role: network_plugin, tags: network }

The role kuberenetes/kubeadm will issue a kubeadmin join command. Then asynchronously the calico-node pod will start to run. This pod will create the /opt/cni/bin/calico file which doesn't yet exist.

Then in parallel, in network_plugin/cni/tasks/main.yml, it will do a recursive owner change against all of /opt/cni/bin/ to set it as the kube user.

There is one more factor at play, I think. Because doing the owner change will also remove the suid bit. But in the failure scenario, I'm seeing both the suid bit, AND kube owner.

Theory:

When the binary is in the process of creation, it first writes it to a tmp file (/opt/cni/bin/calico.tmp) file to stage it. I'm thinking it's possible the owner change happens at this point in time, affecting the temp file. Then the file gets renamed, followed by a chmod to set the suid bit (reference). The owner stays kube. This would explain how both the suid bit and kube owner are present at the same time.

Proposal Fix:

Would it be reasonable to allow for the /opt/cni/bin owner to be overridden? Something like owner: "{{ cni_bin_owner | default(kube_owner) }}" (or define the default in defaults/main.yml)?

VannTen commented 8 months ago

Would it be reasonable to allow for the /opt/cni/bin owner to be overridden? Something like owner: "{{ cni_bin_owner | default(kube_owner) }}" (or define the default in defaults/main.yml)?

I'd rather fix the underlying problem. If that's indeed the race condition you describe, it could come back to bite us in surprising and hard to diagnose ways

Rickkwa commented 8 months ago

I agree. Would that be in the calico plugin?

I created an XS-sized PR that would allow it to unblock me. And there is also another use case for it in #10499. I'm hoping maybe the PR can be merged, while keeping this issue open.

lanss315425 commented 2 weeks ago

Hello, I am using v2.23.3 and also encountered this issue when adding a work node using scale. yml. It is currently in production and I do not want to change the current version. Can I fix this problem by merging the code differences? Thank you ，could you please let me know how to proceed? https://github.com/kubernetes-sigs/kubespray/pull/10929 @Rickkwa

Rickkwa commented 2 weeks ago

@lanss315425 If your issue is indeed the same as mine, then you should be able to apply the patch from my PR and then use a group_var to set cni_bin_owner: root

kubernetes-sigs / kubespray