kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
15.82k stars 6.41k forks source link

OpenStack Cloud Provider Initialization Failure Due to DNSPolicy in DaemonSet Template #10914

Closed kolovo closed 3 months ago

kolovo commented 6 months ago

What happened?

When deploying Kubernetes using Kubespray with OpenStack as the external cloud provider, the cloud provider initialization fails with the following error: W0212 09:05:21.997886 1 openstack.go:173] New openstack client created failed with config: Post "https://<redacted>:5000/v3/auth/tokens": dial tcp: lookup <redacted> on 10.233.0.3:53: write udp 10.233.0.3:48927->10.233.0.3:53: write: operation not permitted F0212 09:05:21.998071 1 main.go:84] Cloud provider could not be initialized: could not init cloud provider "openstack": Post "https://<redacted>:5000/v3/auth/tokens": dial tcp: lookup <redacted> on 10.233.0.3:53: write udp 10.233.0.3:48927->10.233.0.3:53: write: operation not permitted

This issue appears to be related to DNS resolution failures when the OpenStack cloud provider attempts to authenticate with the OpenStack API. The problem is linked to the DNS policy configuration introduced in commit c440106 (link to the commit) in the external-openstack-cloud-controller-manager-ds.yml.j2 template (direct link to the affected line). The CoreDNS pod cannot start due to the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint, which in turn causes the OpenStack cloud controller to fail initialization.

Notably, this DNS policy setting does not align with the default configurations provided by the official OpenStack cloud provider repository, both in the Helm chart (link to chart) and the plain manifests (link to plain manifest).

What did you expect to happen?

I expected the OpenStack cloud provider to initialize successfully without DNS resolution issues. The official configurations from the OpenStack cloud provider repository do not specify a DNSpolicy, allowing pods to inherit DNS settings from the host, which seems to avoid such initialization problems.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy a Kubernetes cluster using Kubespray with the OpenStack external cloud provider enabled.
  2. Observe the failure of the OpenStack cloud controller manager to start, with logs indicating DNS resolution failures similar to the ones provided above.
  3. Note the presence of the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint preventing CoreDNS from starting, which is crucial for DNS resolution by the cloud controller.

OS

uname -srm

Linux 5.15.0-69-generic x86_64

cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.2 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.2 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

Version of Ansible

ansible [core 2.15.9] config file = /home/ansible/kubespray_2240/ansible.cfg configured module search path = ['/home/ansible/kubespray_2240/library'] ansible python module location = /home/ansible/python_venvs/kubespray_2231/lib/python3.10/site-packages/ansible ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections executable location = /home/ansible/python_venvs/kubespray_2231/bin/ansible python version = 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] (/home/ansible/python_venvs/kubespray_2231/bin/python3) jinja version = 3.1.2 libyaml = True

Version of Python

Python 3.10.13

Version of Kubespray (commit)

64447e745

Network plugin used

cilium

Full inventory with variables

### addons.yml ###
metrics_server_enabled: true
metrics_server_replicas: 3
ingress_nginx_enabled: false

### etcd.yml ###
etcd_deployment_type: kubeadm

### k8s-cluster.yml ###
kube_version: v1.28.6
kube_network_plugin: cilium
enable_nodelocaldns: false
kubeconfig_localhost: true
supplementary_addresses_in_ssl_keys: ["redacted"]
# kube_proxy_remove: false

### all.yml ###
cloud_provider: external
external_cloud_provider: openstack

### openstack.yml ###
cinder_csi_enabled: true
cinder_topology: true
cinder_csi_ignore_volume_az: true

# kube_feature_gates:
# - CSIMigration=true
# - CSIMigrationOpenStack=true
# - ExpandCSIVolumes=true

external_openstack_lbaas_enabled: true
external_openstack_lbaas_floating_network_id: "88fbc66b-4946-469c-9848-8725d5014682"
#external_openstack_lbaas_floating_subnet_id: "Neutron subnet ID to get floating IP from"
external_openstack_lbaas_method: ROUND_ROBIN
external_openstack_lbaas_provider: amphora
external_openstack_lbaas_subnet_id: "6cf12127-41c1-4753-b61b-18a7d0098bf4"
#external_openstack_lbaas_network_id: "c896b852-21f4-472e-8dc4-fb3bf62b96bc"
external_openstack_lbaas_manage_security_groups: false
external_openstack_lbaas_create_monitor: true
external_openstack_lbaas_monitor_delay: '5s'
external_openstack_lbaas_monitor_max_retries: 1
external_openstack_lbaas_monitor_timeout: '3s'
external_openstack_lbaas_internal_lb: false

override_system_hostname: false

### k8s-net-cilium.yml ###
cilium_version: "v1.13.3"
cilium_cpu_limit: 1000m
cilium_memory_limit: 2000M
cilium_cpu_requests: 500m
cilium_memory_requests: 500M
cilium_enable_hubble: true
cilium_enable_hubble_metrics: true
cilium_hubble_metrics:
- dns
- drop
- tcp
- flow
- icmp
- http
cilium_hubble_install: true
cilium_hubble_tls_generate: true

inventory not modified used the default provided : contrib/terraform/terraform.py

Command used to invoke ansible

ansible-playbook cluster.yml --become -i inventory/$K8S_CLUSTER_NAME/tf_state_kubespray.py -e @inventory/$K8S_CLUSTER_NAME/$K8S_CLUSTER_NAME.yaml -e @inventory/$K8S_CLUSTER_NAME/no_floating.yml -e "ansible_ssh_private_key_file=/home/ansible/keys/generic_vm_id_rsa" -e external_openstack_lbaas_floating_network_id=$KUBESPRAY_FLOATING_NETWORK_ID -e external_openstack_lbaas_subnet_id=$KUBESPRAY_PRIVATE_SUBNET_ID

Output of ansible run

PLAY RECAP ***** localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
phnx-demo1-k8s-bastion-1 : ok=6 changed=1 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
phnx-demo1-k8s-k8s-master-1 : ok=736 changed=132 unreachable=0 failed=0 skipped=1135 rescued=0 ignored=2
phnx-demo1-k8s-k8s-master-2 : ok=668 changed=115 unreachable=0 failed=0 skipped=1039 rescued=0 ignored=2
phnx-demo1-k8s-k8s-master-3 : ok=670 changed=116 unreachable=0 failed=0 skipped=1037 rescued=0 ignored=2
phnx-demo1-k8s-k8s-node-worker-1 : ok=659 changed=103 unreachable=0 failed=0 skipped=787 rescued=0 ignored=1
phnx-demo1-k8s-k8s-node-worker-2 : ok=659 changed=103 unreachable=0 failed=0 skipped=777 rescued=0 ignored=1
phnx-demo1-k8s-k8s-node-worker-3 : ok=659 changed=103 unreachable=0 failed=0 skipped=777 rescued=0 ignored=1

Monday 12 February 2024 11:17:45 +0000 (0:00:01.315) 0:28:38.640 *** =============================================================================== kubernetes/control-plane : Joining control plane node to the cluster. ------------------------------------------------------ 77.33s container-engine/containerd : Download_file | Download item ---------------------------------------------------------------- 24.70s kubernetes/preinstall : Update package management cache (APT) -------------------------------------------------------------- 23.79s download : Download_container | Download image if required ----------------------------------------------------------------- 23.19s network_plugin/cilium : Cilium | Wait for pods to run ---------------------------------------------------------------------- 21.59s container-engine/crictl : Download_file | Download item -------------------------------------------------------------------- 21.21s container-engine/runc : Download_file | Download item ---------------------------------------------------------------------- 20.93s container-engine/nerdctl : Download_file | Download item ------------------------------------------------------------------- 20.56s kubernetes/kubeadm : Join to cluster --------------------------------------------------------------------------------------- 19.61s container-engine/crictl : Extract_file | Unpacking archive ----------------------------------------------------------------- 19.15s kubernetes/preinstall : Install packages requirements ---------------------------------------------------------------------- 18.59s container-engine/nerdctl : Download_file | Validate mirrors ---------------------------------------------------------------- 16.20s container-engine/nerdctl : Extract_file | Unpacking archive ---------------------------------------------------------------- 14.19s kubernetes/control-plane : Kubeadm | Initialize first master --------------------------------------------------------------- 13.76s container-engine/containerd : Download_file | Validate mirrors ------------------------------------------------------------- 13.29s container-engine/crictl : Download_file | Validate mirrors ----------------------------------------------------------------- 12.36s container-engine/runc : Download_file | Validate mirrors ------------------------------------------------------------------- 12.24s etcdctl_etcdutl : Download_file | Download item ---------------------------------------------------------------------------- 11.08s network_plugin/cilium : Cilium | Create Cilium node manifests -------------------------------------------------------------- 10.79s download : Download_container | Download image if required ----------------------------------------------------------------- 10.77s

Anything else we need to know

Removing the dnsPolicy parameter from the external-openstack-cloud-controller-manager-ds.yml.j2 template allows the OpenStack cloud controller pod to resolve DNS queries using the host's DNS settings. This change resolves the issue and allows the OpenStack cloud controller manager to start without errors.

It may be beneficial to align Kubespray's configuration with the official OpenStack cloud provider templates by not specifying a DNSpolicy unless necessary, to prevent such issues from occurring in deployments.

Payback159 commented 6 months ago

Hello @kolovo ,

can you also post the values for upstream_dns_servers and resolvconf_mode?

We have set resolvconf_mode: host_resolvconf in our cluster and configured additional upstream servers and cluster provisioning works without any problems.

Maybe I can recreate your setup and find the cause.

kolovo commented 6 months ago

Hello @Payback159

Thank you for your reply. I'll share the information soon. However I'm not overriding any other parameters except from the above mentioned. If there is extra configuration needed to be applied in order to make it work with your changes it should be documented. Before these changes, or if i remove them manually or even use directly the official manifests everything works fine. Nevertheless, I suggest changes like this one be made in the upstream OpenStack Cloud Controller repository (link) directly, as it's the primary source. The modifications made in the Kubespray repo for the OpenStack Cloud Controller Manager should align with the official repository to maintain consistency. Regards

tico88612 commented 3 months ago

I have the same problem and agree with @kolovo. This should be consistent with the OpenStack Cloud Controller Manager official repository settings.

UPD: #kubespray-dev discussion