kcalmond commented 3 years ago

Details

What steps did you take and what happened:

Trying out the updated Ansible installation playbook today on a freshly imaged 20.04 four node pi4B cluster. Hit this failure mode when running env ANSIBLE_CONFIG=ansible/ansible.cfg ansible-playbook ansible/playbooks/all.yml:

TASK [cluster : join | add node to cluster] ******************************************************************************************************************************************
Sunday 14 February 2021  10:42:28 -0800 (0:00:00.793)       0:09:52.749 *******
fatal: [blackberry]: FAILED! => changed=true
  cmd:
  - kubeadm
  - join
  - --config
  - /etc/kubernetes/kubeadm-join.yaml
  delta: '0:05:08.846494'
  end: '2021-02-14 18:47:38.301132'
  msg: non-zero return code
  rc: 1
  start: '2021-02-14 18:42:29.454638'
  stderr: |2-
            [WARNING SystemVerification]: missing optional cgroups: hugetlb
    error execution phase preflight: couldn't validate the identity of the API Server: Get "https://192.168.91.240:8443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    To see the stack trace of this error execute with --v=5 or higher

What did you expect to happen: All playbooks run to completion with a 4 node k8s cluster started up successfully.

Anything else you would like to add:

I searched for 192.168.91.240 in the local clone. Found it inside the the old variable section of ansible/group_vars/cluster.yml...

## Old variables
####
# Role - keepalived
####
keepalived_vip: 192.168.91.240

Note: Miscellaneous information that will assist in solving the issue.

Additional Information:

Note: Anything to give further context to the bug report.

kcalmond commented 3 years ago

Update: I figured I'd set the keepalived_vip value in the old variables section of cluster.yml to the same value I'm using in master.yml and try again. First I ran the nuke playbook on the existing deploy. Running all.yml failed again, this time because it found config inside /etc/kubernetes/manifests/. Maybe a problem with nuke here.

kcalmond commented 3 years ago

Update: I reimaged all the pi's back to back ubuntu per the first part of your install doc. I ran all.yaml install again. This time I have keepalived_vip value in the old variables section of cluster.yml to the same value I'm using in master.yml. The install moved past the keepalived_vip problem caused the old variables setting. Hit another failure mode:

TASK [cluster : join | add node to cluster] ******************************************************************************************************************************************
Sunday 14 February 2021  12:17:44 -0800 (0:00:00.919)       0:09:29.725 *******
changed: [blackberry]
fatal: [strawberry]: FAILED! => changed=true
  cmd:
  - kubeadm
  - join
  - --config
  - /etc/kubernetes/kubeadm-join.yaml
  delta: '0:04:12.506743'
  end: '2021-02-14 20:21:57.663006'
  msg: non-zero return code
  rc: 1
  start: '2021-02-14 20:17:45.156263'
  stderr: |2-
            [WARNING SystemVerification]: missing optional cgroups: hugetlb
    error execution phase kubelet-start: error uploading crisocket: timed out waiting for the condition
    To see the stack trace of this error execute with --v=5 or higher
  stderr_lines: <omitted>
  stdout: |-
    [preflight] Running pre-flight checks
    [preflight] Reading configuration from the cluster...
    [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
    [preflight] Running pre-flight checks before initializing the new control plane instance
    [preflight] Pulling images required for setting up a Kubernetes cluster
    [preflight] This might take a minute or two, depending on the speed of your internet connection
    [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
    [download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
    [certs] Using certificateDir folder "/etc/kubernetes/pki"
    [certs] Generating "etcd/peer" certificate and key
    [certs] etcd/peer serving cert is signed for DNS names [localhost strawberry] and IPs [192.168.0.52 127.0.0.1 ::1]
    [certs] Generating "apiserver-etcd-client" certificate and key
    [certs] Generating "etcd/server" certificate and key
    [certs] etcd/server serving cert is signed for DNS names [localhost strawberry] and IPs [192.168.0.52 127.0.0.1 ::1 192.168.0.50]
    [certs] Generating "etcd/healthcheck-client" certificate and key
    [certs] Generating "front-proxy-client" certificate and key
    [certs] Generating "apiserver-kubelet-client" certificate and key
    [certs] Generating "apiserver" certificate and key
    [certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local strawberry] and IPs [10.144.0.1 192.168.0.52 192.168.0.50]
    [certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
    [certs] Using the existing "sa" key
    [kubeconfig] Generating kubeconfig files
    [kubeconfig] Using kubeconfig folder "/etc/kubernetes"
    [endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
    [kubeconfig] Writing "admin.conf" kubeconfig file
    [endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
    [kubeconfig] Writing "controller-manager.conf" kubeconfig file
    [endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
    [kubeconfig] Writing "scheduler.conf" kubeconfig file
    [control-plane] Using manifest folder "/etc/kubernetes/manifests"
    [control-plane] Creating static Pod manifest for "kube-apiserver"
    [control-plane] Creating static Pod manifest for "kube-controller-manager"
    [control-plane] Creating static Pod manifest for "kube-scheduler"
    [check-etcd] Checking that the etcd cluster is healthy
    [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
    [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
    [kubelet-start] Starting the kubelet
    [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
    [kubelet-check] Initial timeout of 40s passed.
  stdout_lines: <omitted>

After about 10 minutes of no output on stdout the play continued, completing a group of cni and storage tasks, then completing the play:

TASK [cni : preflight checks] ********************************************************************************************************************************************************
Sunday 14 February 2021  12:28:29 -0800 (0:10:45.179)       0:20:14.905 *******
included: /Users/chris/GH/raspbernetes/k8s-cluster-installation/ansible/roles/cni/tasks/pre_checks.yml for hackberry

TASK [cni : validate variable : cni_plugin] ******************************************************************************************************************************************
Sunday 14 February 2021  12:28:29 -0800 (0:00:00.099)       0:20:15.004 *******
ok: [hackberry] => changed=false
  msg: All assertions passed

TASK [cni : validate variable : cni_main_master] *************************************************************************************************************************************
Sunday 14 February 2021  12:28:30 -0800 (0:00:00.119)       0:20:15.124 *******
ok: [hackberry] => changed=false
  msg: All assertions passed

TASK [cni : validate plugin support] *************************************************************************************************************************************************
Sunday 14 February 2021  12:28:30 -0800 (0:00:00.115)       0:20:15.239 *******
ok: [hackberry] => changed=false
  msg: All assertions passed

TASK [cni : validate pod subnet is used when using flannel] **************************************************************************************************************************
Sunday 14 February 2021  12:28:30 -0800 (0:00:00.161)       0:20:15.401 *******
skipping: [hackberry]

TASK [cni : validate pod subnet is used when using cilium] ***************************************************************************************************************************
Sunday 14 February 2021  12:28:30 -0800 (0:00:00.117)       0:20:15.518 *******
skipping: [hackberry]

TASK [cni : setup calico container network interface (cni)] **************************************************************************************************************************
Sunday 14 February 2021  12:28:30 -0800 (0:00:00.117)       0:20:15.635 *******
skipping: [blueberry]
skipping: [blackberry]
included: /Users/chris/GH/raspbernetes/k8s-cluster-installation/ansible/roles/cni/tasks/calico.yml for hackberry

TASK [cni : applying calico] *********************************************************************************************************************************************************
Sunday 14 February 2021  12:28:30 -0800 (0:00:00.202)       0:20:15.838 *******
changed: [hackberry]

TASK [storage : preflight checks] ****************************************************************************************************************************************************
Sunday 14 February 2021  12:28:41 -0800 (0:00:10.349)       0:20:26.188 *******
included: /Users/chris/GH/raspbernetes/k8s-cluster-installation/ansible/roles/storage/tasks/pre_checks.yml for hackberry

TASK [storage : check os_family support] *********************************************************************************************************************************************
Sunday 14 February 2021  12:28:41 -0800 (0:00:00.093)       0:20:26.281 *******
ok: [hackberry] => changed=false
  msg: All assertions passed

TASK [storage : validate variable : openebs_enabled] *********************************************************************************************************************************
Sunday 14 February 2021  12:28:41 -0800 (0:00:00.146)       0:20:26.428 *******
ok: [hackberry] => changed=false
  msg: All assertions passed

TASK [storage : include family specific tasks] ***************************************************************************************************************************************
Sunday 14 February 2021  12:28:41 -0800 (0:00:00.104)       0:20:26.532 *******
skipping: [hackberry]
skipping: [blueberry]
skipping: [blackberry]

PLAY RECAP ***************************************************************************************************************************************************************************
127.0.0.1                  : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0
blackberry                 : ok=67   changed=38   unreachable=0    failed=0    skipped=14   rescued=0    ignored=0
blueberry                  : ok=85   changed=45   unreachable=0    failed=0    skipped=18   rescued=0    ignored=0
hackberry                  : ok=130  changed=51   unreachable=0    failed=0    skipped=22   rescued=0    ignored=0
strawberry                 : ok=86   changed=44   unreachable=0    failed=1    skipped=18   rescued=0    ignored=0

Sunday 14 February 2021  12:28:41 -0800 (0:00:00.180)       0:20:26.712 *******
===============================================================================
cluster : join | add node to cluster ---------------------------------------------------------------------------------------------------------------------------------------- 645.18s
cluster : initialize | execute kubeadm init on first control plane node ----------------------------------------------------------------------------------------------------- 141.82s
common : reboot hosts -------------------------------------------------------------------------------------------------------------------------------------------------------- 63.43s
kubernetes : install kubernetes packages (3/4) ------------------------------------------------------------------------------------------------------------------------------- 41.31s
keepalived : debian : install keepalived ------------------------------------------------------------------------------------------------------------------------------------- 36.21s
haproxy : debian : install haproxy ------------------------------------------------------------------------------------------------------------------------------------------- 28.68s
cri : containerd | install package from apt repository ----------------------------------------------------------------------------------------------------------------------- 21.78s
common : install common Kubernetes ansible module ---------------------------------------------------------------------------------------------------------------------------- 20.33s
common : install common packages --------------------------------------------------------------------------------------------------------------------------------------------- 20.03s
kubernetes : install helm package (3/3) -------------------------------------------------------------------------------------------------------------------------------------- 19.10s
kubernetes : adding apt repository for kubernetes (2/4) ---------------------------------------------------------------------------------------------------------------------- 17.17s
cri : containerd | ensure docker.io apt repository is enabled ---------------------------------------------------------------------------------------------------------------- 16.62s
kubernetes : adding apt repository for helm (2/3) ---------------------------------------------------------------------------------------------------------------------------- 16.31s
cni : applying calico -------------------------------------------------------------------------------------------------------------------------------------------------------- 10.35s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------- 9.31s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.38s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.81s
cri : crictl | unarchive crictl binary ---------------------------------------------------------------------------------------------------------------------------------------- 3.98s
common : apt-get upgrade ------------------------------------------------------------------------------------------------------------------------------------------------------ 3.55s
kubernetes : add apt signing key for kubernetes (1/4) ------------------------------------------------------------------------------------------------------------------------- 3.49s

Here is cluster node status at this point:

> kubectl get nodes -o wide --kubeconfig ./k8s-config.yaml
NAME         STATUS     ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
blackberry   NotReady   <none>                 11m   v1.20.2   192.168.0.54   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3
blueberry    NotReady   control-plane,master   10m   v1.20.2   192.168.0.53   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3
hackberry    NotReady   control-plane,master   12m   v1.20.2   192.168.0.51   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3
strawberry   NotReady   <none>                 72s   v1.20.2   192.168.0.52   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3

I decide to run the all.yml play again. This time with no fatal errors. However, my node status shows only two masters (not three as I spec'd in the inventory):

> kubectl get nodes -o wide --kubeconfig ansible/playbooks/output/k8s-config.yaml
NAME         STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
blackberry   Ready    <none>                 28m   v1.20.2   192.168.0.54   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3
blueberry    Ready    control-plane,master   26m   v1.20.2   192.168.0.53   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3
hackberry    Ready    control-plane,master   28m   v1.20.2   192.168.0.51   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3
strawberry   Ready    <none>                 17m   v1.20.2   192.168.0.52   <none>        Ubuntu 20.04.2 LTS   5.4.0-1028-raspi   containerd://1.4.3

kcalmond commented 3 years ago

More cluster state details re above. Note that one of my declared masters (strawberry) is not running etcd or kube-apiserver.

hackberry (master):

pi@hackberry:~$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
9a81d4f033456       35af1df8416d5       30 minutes ago      Running             calico-node               0                   63f1cc1b988cd
a9db1cfcdd76c       95d99817fc335       30 minutes ago      Running             kube-apiserver            7                   f420ec7436438
968b95dbae0c9       05b738aa1bc63       34 minutes ago      Running             etcd                      2                   321b103cca468
b2227ea8e2d97       3a1a2b528610a       40 minutes ago      Running             kube-controller-manager   1                   d47890e1ba157
7a7753851d9ab       60d957e44ec8a       40 minutes ago      Running             kube-scheduler            1                   f625fffc45bbf
aeb39c7ac3efd       788e63d07298d       42 minutes ago      Running             kube-proxy                0                   ab9d064b9295e

blueberry (master):

pi@blueberry:~$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
ac54a7b16f731       3a1a2b528610a       30 minutes ago      Running             kube-controller-manager   1                   9b185ac5454fd
5a22a7d8528a3       35af1df8416d5       30 minutes ago      Running             calico-node               0                   58fa61fc45d9a
d0054c698ebf0       05b738aa1bc63       32 minutes ago      Running             etcd                      0                   e48cc0575773d
4099be2608d81       95d99817fc335       32 minutes ago      Running             kube-apiserver            6                   be8677296d508
4ba5a75ffb0a6       60d957e44ec8a       40 minutes ago      Running             kube-scheduler            0                   48dda9bf2a9e5
e0306a62e0e89       788e63d07298d       40 minutes ago      Running             kube-proxy                0                   69a4c4384e835

strawberry (intended master):

pi@strawberry:~$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
6405e63c02be4       35af1df8416d5       30 minutes ago      Running             calico-node               0                   034286a83df90
0e6c98f8a89df       788e63d07298d       31 minutes ago      Running             kube-proxy                0                   0740ce290c2ff
30ba49df68752       60d957e44ec8a       40 minutes ago      Running             kube-scheduler            0                   2dd66e00bd2be
4632943f39249       3a1a2b528610a       40 minutes ago      Running             kube-controller-manager   0                   e32207bafed57

blackberry (worker):

pi@blackberry:~$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
2c6e7b609bacc       acf4bc146ed19       29 minutes ago      Running             calico-kube-controllers   0                   8888ae3c57ffc
3afbd3e2d83b7       db91994f4ee8f       30 minutes ago      Running             coredns                   0                   f5955392080c5
21f385bc19c1d       db91994f4ee8f       30 minutes ago      Running             coredns                   0                   15e8ea135da08
c88c330892f02       35af1df8416d5       30 minutes ago      Running             calico-node               0                   b4dd709f7fd86
4eb84272d3f4d       788e63d07298d       42 minutes ago      Running             kube-proxy                0                   2cae842481891

kcalmond commented 3 years ago

In case this helps ^^:

> cat ansible/inventory
[all]
hackberry hostname=hackberry.almond.lan ansible_host=192.168.0.51 ansible_user=pi
strawberry hostname=strawberry.almond.lan ansible_host=192.168.0.52 ansible_user=pi
blueberry hostname=blueberry.almond.lan ansible_host=192.168.0.53 ansible_user=pi
blackberry hostname=blackberry.almond.lan ansible_host=192.168.0.54 ansible_user=pi

[cluster:children]
controlplane
nodes

[controlplane]
hackberry
strawberry
blueberry

[nodes]
blackberry

[docker_cache]
#registry hostname=registry ansible_host=192.168.1.120 ansible_user=pi

; These entires are here for backward compatibility as we transition away from the old names.
[k8s:children]
masters
workers

[masters]
hackberry
strawberry
blueberry

[workers]
blackberry

rkage commented 3 years ago

## Old variables
####
# Role - keepalived
####
keepalived_vip: 192.168.91.240

Ok, so this looks like testing leftovers snuck into the "defaults" - keepalived_vip should normally be an empty string, like it is in the master.yml.

The second issue is an odd one - the API server just doesn't seem to respond properly when another control-plane node attempts to join. There's a check that's built into the playbook to ensure the API server is available prior to attempting to join, sometimes it catches the delay, sometimes it doesn't.

If you execute the nuke playbook; ansible-playbook playbooks/nuke.yml, it will reset the nodes to a state prior to creating the cluster. Executing the all playbook from this state should succeed, please let us know if it is successful.

raspbernetes / k8s-cluster-installation

error execution phase preflight: couldn't validate the identity of the API Server: Get "https://192.168.91.240:8443..." #107

Details