kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.16k stars 6.48k forks source link

Create kubeadm token for joining nodes with 24h expiration (default) Fails #9907

Closed sbbroot closed 7 months ago

sbbroot commented 1 year ago

Environment:

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8" ALMALINUX_MANTISBT_PROJECT_VERSION="8.5"

- **Version of Ansible** (`ansible --version`):

ansible [core 2.12.5] config file = None configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python3.8/dist-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] jinja version = 2.11.3 libyaml = True

- **Version of Python** (`python --version`):

Python 3.8.10


**Kubespray version (commit) (`git rev-parse --short HEAD`):**
[1bed6c1dfefe](https://quay.io/repository/kubespray/kubespray/manifest/sha256:1bed6c1dfefed36951f10dcfb73f4db7bef83087a3bba3f57117fca44424329b)

**Network plugin used**:

Calico

**Full inventory with variables (`ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"`):**
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->

all: hosts: node1: ansible_host: 192.168.122.211 ip: 192.168.122.211 access_ip: 192.168.122.211 node2: ansible_host: 192.168.122.212 ip: 192.168.122.212 access_ip: 192.168.122.212 node3: ansible_host: 192.168.122.213 ip: 192.168.122.213 access_ip: 192.168.122.213 children: kube_control_plane: hosts: node1: node2: kube_node: hosts: node1: node2: node3: etcd: hosts: node1: node2: node3: k8s_cluster: children: kube_control_plane: kube_node: calico_rr: hosts: {}

**Command used to invoke ansible**:

ansible-playbook -i inventory/onprem/hosts.yaml --become --user=root --become-user=root cluster.yml


**Output of ansible run**:
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->

TASK [kubernetes/control-plane : kubeadm | Copy kubeadm patches from inventory files] *****


skipping: [node1] skipping: [node2]

TASK [kubernetes/control-plane : kubeadm | Initialize first master] ***


skipping: [node1] skipping: [node2]

TASK [kubernetes/control-plane : set kubeadm certificate key] *****


TASK [kubernetes/control-plane : Create hardcoded kubeadm token for joining nodes with 24h expiration (if defined)] ***


skipping: [node1] skipping: [node2]

TASK [kubernetes/control-plane : Create kubeadm token for joining nodes with 24h expiration (default)] ****


FAILED - RETRYING: [node1]: Create kubeadm token for joining nodes with 24h expiration (default) (5 retries left). FAILED - RETRYING: [node2 -> node1]: Create kubeadm token for joining nodes with 24h expiration (default) (5 retries left). FAILED - RETRYING: [node2 -> node1]: Create kubeadm token for joining nodes with 24h expiration (default) (4 retries left). FAILED - RETRYING: [node1]: Create kubeadm token for joining nodes with 24h expiration (default) (4 retries left). FAILED - RETRYING: [node2 -> node1]: Create kubeadm token for joining nodes with 24h expiration (default) (3 retries left). FAILED - RETRYING: [node1]: Create kubeadm token for joining nodes with 24h expiration (default) (3 retries left). FAILED - RETRYING: [node2 -> node1]: Create kubeadm token for joining nodes with 24h expiration (default) (2 retries left). FAILED - RETRYING: [node1]: Create kubeadm token for joining nodes with 24h expiration (default) (2 retries left). FAILED - RETRYING: [node1]: Create kubeadm token for joining nodes with 24h expiration (default) (1 retries left). FAILED - RETRYING: [node2 -> node1]: Create kubeadm token for joining nodes with 24h expiration (default) (1 retries left). fatal: [node2 -> node1(192.168.122.201)]: FAILED! => {"attempts": 5, "changed": false, "cmd": ["/usr/local/bin/kubeadm", "--kubeconfig", "/etc/kubernetes/admin.conf", "token", "create"], "delta": "0:01:15.090228 ", "end": "2023-03-16 14:36:35.958446", "msg": "non-zero return code", "rc": 1, "start": "2023-03-16 14:35:20.868218", "stderr": "timed out waiting for the condition\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["timed out waiting for the condition", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "", "stdout_lines": []} fatal: [node1]: FAILED! => {"attempts": 5, "changed": false, "cmd": ["/usr/local/bin/kubeadm", "--kubeconfig", "/etc/kubernetes/admin.conf", "token", "create"], "delta": "0:01:15.111611", "end": "2023-03-16 14:3 6:35.982683", "msg": "non-zero return code", "rc": 1, "start": "2023-03-16 14:35:20.871072", "stderr": "timed out waiting for the condition\nTo see the stack trace of this error execute with --v=5 or higher", "s tderr_lines": ["timed out waiting for the condition", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ****


PLAY RECAP ****


localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=569 changed=16 unreachable=0 failed=1 skipped=754 rescued=0 ignored=2 node2 : ok=529 changed=14 unreachable=0 failed=1 skipped=647 rescued=0 ignored=2 node3 : ok=474 changed=12 unreachable=0 failed=0 skipped=577 rescued=0 ignored=1

**Anything else do we need to know**:
<!-- By running scripts/collect-info.yaml you can get a lot of useful informations.
Script can be started by:
ansible-playbook -i <inventory_file_path> -u <ssh_user> -e ansible_ssh_user=<ssh_user> -b --become-user=root -e dir=`pwd` scripts/collect-info.yaml
(If you using CoreOS remember to add '-e ansible_python_interpreter=/opt/bin/python').
After running this command you can find logs in `pwd`/logs.tar.gz. You can even upload somewhere entire file and paste link here.-->

Got the same issue with Ubuntu-20.04.
Here is the output on nodes1,2,3:

root@node1:~# /usr/local/bin/kubeadm --kubeconfig /etc/kubernetes/admin.conf token create --v=8 I0316 15:00:22.062350 114713 token.go:119] [token] validating mixed arguments I0316 15:00:22.062534 114713 token.go:128] [token] getting Clientsets from kubeconfig file I0316 15:00:22.063512 114713 loader.go:374] Config loaded from file: /etc/kubernetes/admin.conf I0316 15:00:22.064669 114713 token.go:243] [token] loading configurations I0316 15:00:22.065165 114713 interface.go:432] Looking for default routes with IPv4 addresses I0316 15:00:22.065289 114713 interface.go:437] Default route transits interface "enp1s0" I0316 15:00:22.065514 114713 interface.go:209] Interface enp1s0 is up I0316 15:00:22.065782 114713 interface.go:257] Interface "enp1s0" has 2 addresses :[192.168.122.201/24 fe80::5054:ff:fe5f:e352/64]. I0316 15:00:22.065935 114713 interface.go:224] Checking addr 192.168.122.201/24. I0316 15:00:22.066041 114713 interface.go:231] IP found 192.168.122.201 I0316 15:00:22.066180 114713 interface.go:263] Found valid IPv4 address 192.168.122.201 for interface "enp1s0". I0316 15:00:22.066247 114713 interface.go:443] Found active IP 192.168.122.201 I0316 15:00:22.066421 114713 kubelet.go:196] the value of KubeletConfiguration.cgroupDriver is empty; setting it to "systemd" I0316 15:00:22.077310 114713 token.go:250] [token] creating token I0316 15:00:22.077882 114713 round_trippers.go:463] GET https://192.168.122.201:6443/api/v1/namespaces/kube-system/secrets/bootstrap-token-xy5i84?timeout=10s I0316 15:00:22.077991 114713 round_trippers.go:469] Request Headers: I0316 15:00:22.078175 114713 round_trippers.go:473] Accept: application/json, / I0316 15:00:22.078278 114713 round_trippers.go:473] User-Agent: kubeadm/v1.25.6 (linux/amd64) kubernetes/ff2c119 I0316 15:00:22.081208 114713 round_trippers.go:574] Response Status: in 2 milliseconds I0316 15:00:22.081383 114713 round_trippers.go:577] Response Headers: I0316 15:00:22.082563 114713 request.go:1172] Request Body: {"kind":"Secret","apiVersion":"v1","metadata":{"name":"bootstrap-token-xy5i84","namespace":"kube-system","creationTimestamp":null},"data":{"auth-extra-groups":"c3lzdGVtOmJvb3RzdHJhcHBlcnM6a3ViZWFkbTpkZWZhdWx0LW5vZGUtdG9rZW4=","expiration":"MjAyMy0wMy0xN1QxNTowMDoyMlo=","token-id":"eHk1aTg0","token-secret":"Mmh1Z210MDZlMjJuOWdqeg==","usage-bootstrap-authentication":"dHJ1ZQ==","usage-bootstrap-signing":"dHJ1ZQ=="},"type":"bootstrap.kubernetes.io/token"} ... timed out waiting for the condition


offline.yml:
```yaml
---
## Global Offline settings
### Private Container Image Registry
registry_host: 192.168.122.14:5000
files_repo: "http://192.168.122.13/repository/files"

yum_repo: http://192.168.122.13/repository  ### If using RedHat or AlmaLinux

## Container Registry overrides
kube_image_repo: "{{ registry_host }}"
gcr_image_repo: "{{ registry_host }}"
github_image_repo: "{{ registry_host }}"
docker_image_repo: "{{ registry_host }}"
quay_image_repo: "{{ registry_host }}"

## Kubernetes components
kubeadm_download_url: "{{ files_repo }}/kubeadm"
kubectl_download_url: "{{ files_repo }}/kubectl"
kubelet_download_url: "{{ files_repo }}/kubelet"

## CNI Plugins
cni_download_url: "{{ files_repo }}/cni-plugins-linux-amd64-v1.1.1.tgz"

## cri-tools
crictl_download_url: "{{ files_repo }}/crictl-v1.25.0-linux-amd64.tar.gz"

## [Optional] etcd: only if you **DON'T** use etcd_deployment=host
etcd_download_url: "{{ files_repo }}/etcd-v3.5.6-linux-amd64.tar.gz"

## [Optional] Calico: If using Calico network plugin
calicoctl_download_url: "{{ files_repo }}/calicoctl-linux-amd64"
calicoctl_alternate_download_url: "{{ files_repo }}/calicoctl-linux-amd64"

## Containerd
# [Optional] runc,containerd: only if you set container_runtime: containerd
runc_download_url: "{{ files_repo }}/runc.amd64"
containerd_download_url: "{{ files_repo }}/containerd-1.6.15-linux-amd64.tar.gz"
nerdctl_download_url: "{{ files_repo }}/nerdctl-1.0.0-linux-amd64.tar.gz"

containerd.yml:

---
# Please see roles/container-engine/containerd/defaults/main.yml for more configuration options

# containerd_storage_dir: "/var/lib/containerd"
# containerd_state_dir: "/run/containerd"
# containerd_oom_score: 0

# containerd_default_runtime: "runc"
# containerd_snapshotter: "native"

# containerd_runc_runtime:
#   name: runc
#   type: "io.containerd.runc.v2"
#   engine: ""
#   root: ""

# containerd_additional_runtimes:
# Example for Kata Containers as additional runtime:
#   - name: kata
#     type: "io.containerd.kata.v2"
#     engine: ""
#     root: ""

# containerd_grpc_max_recv_message_size: 16777216
# containerd_grpc_max_send_message_size: 16777216

# containerd_debug_level: "info"

# containerd_metrics_address: ""

# containerd_metrics_grpc_histogram: false

## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define mirror.registry.io or 172.19.16.11:5000
## set "name": "url". insecure url must be started http://
## Port number is also needed if the default HTTPS port is not used.
containerd_insecure_registries:
  "localhost": "http://127.0.0.1"
  "192.168.122.14:5000": "http://192.168.122.14:5000"

# containerd_registries:
#   "docker.io": "https://registry-1.docker.io"

# containerd_max_container_log_line_size: -1

# containerd_registry_auth:
#   - registry: 10.0.0.2:5000
#     username: user
#     password: pass
lord0gnome commented 1 year ago

We had the same issue in an offline env, and fixed it by doing a kubeadm reset on each of the masters, then running the playbook again.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/kubespray/issues/9907#issuecomment-2011701451): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
clayrisser commented 1 month ago

@lord0gnome terrible advice. kubeadm reset wiped my cluster.

lord0gnome commented 1 month ago

I'm very sorry @clayrisser . I should have mentioned that we were installing a new cluster and so there was no consequence to doing a kubeadm reset. My understanding of what we're doing at the time was minimal and I was just trying to help others who may have been stuck in similar situations.