kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
15.78k stars 6.4k forks source link

fallback_ips.yml exits early when there is an unreachable host in the inventory #10993

Open Rickkwa opened 5 months ago

Rickkwa commented 5 months ago

What happened?

This is a continuation of #10313.

When roles/kubespray-defaults/tasks/fallback_ips.yml runs on a inventory with an unreachable host, it'll exit the entire play after the setup task, with NO MORE HOSTS LEFT.

What did you expect to happen?

I expect the entire kubespray-defaults role to finish running, but it exits the play after the single task.

How can we reproduce it (as minimally and precisely as possible)?

Minimal inventory

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

[kube_control_plane]
k8s1.local  # reachable host

[etcd]
k8s1.local  # reachable host

[kube_node]
k8s3.local  # problematic unreachable host
k8s2.local  # reachable host

[calico_rr]

And then this minimal playbook

- name: Prepare nodes for upgrade
  hosts: k8s_cluster:etcd:calico_rr
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }

Execute with ansible-playbook -i hosts.ini bug.yml

OS

Linux 6.5.11-8-pve x86_64
NAME="AlmaLinux"
VERSION="9.3 (Shamrock Pampas Cat)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"

Version of Ansible

Tried both:

ansible [core 2.15.9]
  config file = /root/kubespray-test/kubespray/ansible.cfg
  configured module search path = ['/root/kubespray-test/kubespray/library']
  ansible python module location = /root/kubespray-test/venv-latest/lib64/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/kubespray-test/venv-latest/bin/ansible
  python version = 3.9.18 (main, Jan  4 2024, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] (/root/kubespray-test/venv-latest/bin/python)
  jinja version = 3.1.3
  libyaml = True

and

ansible [core 2.14.14]
  config file = /root/kubespray-test/kubespray/ansible.cfg
  configured module search path = ['/root/kubespray-test/kubespray/library']
  ansible python module location = /root/kubespray-test/venv-2.14/lib64/python3.9/site-packages/ansible  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/kubespray-test/venv-2.14/bin/ansible
  python version = 3.9.18 (main, Jan  4 2024, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] (/root/kubespray-test/venv-2.14/bin/python)
  jinja version = 3.1.3
  libyaml = True

Version of Python

Python 3.9.18

Version of Kubespray (commit)

66eaba377

Network plugin used

calico

Full inventory with variables

See "How can we reproduce it" section. Just that inventory, no variables.

Command used to invoke ansible

See "How can we reproduce it" section

Output of ansible run

PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1

Anything else we need to know

In the PR #10601, it added ignore_unreachable: true. That made it so the Play Recap had ignored=1 instead of unreachable=1. But ultimately it doesn't solve the issue of the play exiting early.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Rickkwa commented 2 months ago

/remove-lifecycle stale