Open kcalmond opened 3 years ago
Update: I figured I'd set the keepalived_vip value in the old variables section of cluster.yml to the same value I'm using in master.yml and try again. First I ran the nuke playbook on the existing deploy. Running all.yml failed again, this time because it found config inside /etc/kubernetes/manifests/. Maybe a problem with nuke here.
Update: I reimaged all the pi's back to back ubuntu per the first part of your install doc. I ran all.yaml install again. This time I have keepalived_vip value in the old variables section of cluster.yml to the same value I'm using in master.yml. The install moved past the keepalived_vip problem caused the old variables setting. Hit another failure mode:
TASK [cluster : join | add node to cluster] ******************************************************************************************************************************************
Sunday 14 February 2021 12:17:44 -0800 (0:00:00.919) 0:09:29.725 *******
changed: [blackberry]
fatal: [strawberry]: FAILED! => changed=true
cmd:
- kubeadm
- join
- --config
- /etc/kubernetes/kubeadm-join.yaml
delta: '0:04:12.506743'
end: '2021-02-14 20:21:57.663006'
msg: non-zero return code
rc: 1
start: '2021-02-14 20:17:45.156263'
stderr: |2-
[WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase kubelet-start: error uploading crisocket: timed out waiting for the condition
To see the stack trace of this error execute with --v=5 or higher
stderr_lines: <omitted>
stdout: |-
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost strawberry] and IPs [192.168.0.52 127.0.0.1 ::1]
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost strawberry] and IPs [192.168.0.52 127.0.0.1 ::1 192.168.0.50]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local strawberry] and IPs [10.144.0.1 192.168.0.52 192.168.0.50]
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "admin.conf" kubeconfig file
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
stdout_lines: <omitted>
After about 10 minutes of no output on stdout the play continued, completing a group of cni and storage tasks, then completing the play:
TASK [cni : preflight checks] ********************************************************************************************************************************************************
Sunday 14 February 2021 12:28:29 -0800 (0:10:45.179) 0:20:14.905 *******
included: /Users/chris/GH/raspbernetes/k8s-cluster-installation/ansible/roles/cni/tasks/pre_checks.yml for hackberry
TASK [cni : validate variable : cni_plugin] ******************************************************************************************************************************************
Sunday 14 February 2021 12:28:29 -0800 (0:00:00.099) 0:20:15.004 *******
ok: [hackberry] => changed=false
msg: All assertions passed
TASK [cni : validate variable : cni_main_master] *************************************************************************************************************************************
Sunday 14 February 2021 12:28:30 -0800 (0:00:00.119) 0:20:15.124 *******
ok: [hackberry] => changed=false
msg: All assertions passed
TASK [cni : validate plugin support] *************************************************************************************************************************************************
Sunday 14 February 2021 12:28:30 -0800 (0:00:00.115) 0:20:15.239 *******
ok: [hackberry] => changed=false
msg: All assertions passed
TASK [cni : validate pod subnet is used when using flannel] **************************************************************************************************************************
Sunday 14 February 2021 12:28:30 -0800 (0:00:00.161) 0:20:15.401 *******
skipping: [hackberry]
TASK [cni : validate pod subnet is used when using cilium] ***************************************************************************************************************************
Sunday 14 February 2021 12:28:30 -0800 (0:00:00.117) 0:20:15.518 *******
skipping: [hackberry]
TASK [cni : setup calico container network interface (cni)] **************************************************************************************************************************
Sunday 14 February 2021 12:28:30 -0800 (0:00:00.117) 0:20:15.635 *******
skipping: [blueberry]
skipping: [blackberry]
included: /Users/chris/GH/raspbernetes/k8s-cluster-installation/ansible/roles/cni/tasks/calico.yml for hackberry
TASK [cni : applying calico] *********************************************************************************************************************************************************
Sunday 14 February 2021 12:28:30 -0800 (0:00:00.202) 0:20:15.838 *******
changed: [hackberry]
TASK [storage : preflight checks] ****************************************************************************************************************************************************
Sunday 14 February 2021 12:28:41 -0800 (0:00:10.349) 0:20:26.188 *******
included: /Users/chris/GH/raspbernetes/k8s-cluster-installation/ansible/roles/storage/tasks/pre_checks.yml for hackberry
TASK [storage : check os_family support] *********************************************************************************************************************************************
Sunday 14 February 2021 12:28:41 -0800 (0:00:00.093) 0:20:26.281 *******
ok: [hackberry] => changed=false
msg: All assertions passed
TASK [storage : validate variable : openebs_enabled] *********************************************************************************************************************************
Sunday 14 February 2021 12:28:41 -0800 (0:00:00.146) 0:20:26.428 *******
ok: [hackberry] => changed=false
msg: All assertions passed
TASK [storage : include family specific tasks] ***************************************************************************************************************************************
Sunday 14 February 2021 12:28:41 -0800 (0:00:00.104) 0:20:26.532 *******
skipping: [hackberry]
skipping: [blueberry]
skipping: [blackberry]
PLAY RECAP ***************************************************************************************************************************************************************************
127.0.0.1 : ok=1 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
blackberry : ok=67 changed=38 unreachable=0 failed=0 skipped=14 rescued=0 ignored=0
blueberry : ok=85 changed=45 unreachable=0 failed=0 skipped=18 rescued=0 ignored=0
hackberry : ok=130 changed=51 unreachable=0 failed=0 skipped=22 rescued=0 ignored=0
strawberry : ok=86 changed=44 unreachable=0 failed=1 skipped=18 rescued=0 ignored=0
Sunday 14 February 2021 12:28:41 -0800 (0:00:00.180) 0:20:26.712 *******
===============================================================================
cluster : join | add node to cluster ---------------------------------------------------------------------------------------------------------------------------------------- 645.18s
cluster : initialize | execute kubeadm init on first control plane node ----------------------------------------------------------------------------------------------------- 141.82s
common : reboot hosts -------------------------------------------------------------------------------------------------------------------------------------------------------- 63.43s
kubernetes : install kubernetes packages (3/4) ------------------------------------------------------------------------------------------------------------------------------- 41.31s
keepalived : debian : install keepalived ------------------------------------------------------------------------------------------------------------------------------------- 36.21s
haproxy : debian : install haproxy ------------------------------------------------------------------------------------------------------------------------------------------- 28.68s
cri : containerd | install package from apt repository ----------------------------------------------------------------------------------------------------------------------- 21.78s
common : install common Kubernetes ansible module ---------------------------------------------------------------------------------------------------------------------------- 20.33s
common : install common packages --------------------------------------------------------------------------------------------------------------------------------------------- 20.03s
kubernetes : install helm package (3/3) -------------------------------------------------------------------------------------------------------------------------------------- 19.10s
kubernetes : adding apt repository for kubernetes (2/4) ---------------------------------------------------------------------------------------------------------------------- 17.17s
cri : containerd | ensure docker.io apt repository is enabled ---------------------------------------------------------------------------------------------------------------- 16.62s
kubernetes : adding apt repository for helm (2/3) ---------------------------------------------------------------------------------------------------------------------------- 16.31s
cni : applying calico -------------------------------------------------------------------------------------------------------------------------------------------------------- 10.35s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------- 9.31s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.38s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.81s
cri : crictl | unarchive crictl binary ---------------------------------------------------------------------------------------------------------------------------------------- 3.98s
common : apt-get upgrade ------------------------------------------------------------------------------------------------------------------------------------------------------ 3.55s
kubernetes : add apt signing key for kubernetes (1/4) ------------------------------------------------------------------------------------------------------------------------- 3.49s
Here is cluster node status at this point:
> kubectl get nodes -o wide --kubeconfig ./k8s-config.yaml
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
blackberry NotReady <none> 11m v1.20.2 192.168.0.54 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
blueberry NotReady control-plane,master 10m v1.20.2 192.168.0.53 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
hackberry NotReady control-plane,master 12m v1.20.2 192.168.0.51 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
strawberry NotReady <none> 72s v1.20.2 192.168.0.52 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
I decide to run the all.yml play again. This time with no fatal errors. However, my node status shows only two masters (not three as I spec'd in the inventory):
> kubectl get nodes -o wide --kubeconfig ansible/playbooks/output/k8s-config.yaml
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
blackberry Ready <none> 28m v1.20.2 192.168.0.54 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
blueberry Ready control-plane,master 26m v1.20.2 192.168.0.53 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
hackberry Ready control-plane,master 28m v1.20.2 192.168.0.51 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
strawberry Ready <none> 17m v1.20.2 192.168.0.52 <none> Ubuntu 20.04.2 LTS 5.4.0-1028-raspi containerd://1.4.3
More cluster state details re above. Note that one of my declared masters (strawberry) is not running etcd or kube-apiserver.
hackberry (master):
pi@hackberry:~$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
9a81d4f033456 35af1df8416d5 30 minutes ago Running calico-node 0 63f1cc1b988cd
a9db1cfcdd76c 95d99817fc335 30 minutes ago Running kube-apiserver 7 f420ec7436438
968b95dbae0c9 05b738aa1bc63 34 minutes ago Running etcd 2 321b103cca468
b2227ea8e2d97 3a1a2b528610a 40 minutes ago Running kube-controller-manager 1 d47890e1ba157
7a7753851d9ab 60d957e44ec8a 40 minutes ago Running kube-scheduler 1 f625fffc45bbf
aeb39c7ac3efd 788e63d07298d 42 minutes ago Running kube-proxy 0 ab9d064b9295e
blueberry (master):
pi@blueberry:~$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
ac54a7b16f731 3a1a2b528610a 30 minutes ago Running kube-controller-manager 1 9b185ac5454fd
5a22a7d8528a3 35af1df8416d5 30 minutes ago Running calico-node 0 58fa61fc45d9a
d0054c698ebf0 05b738aa1bc63 32 minutes ago Running etcd 0 e48cc0575773d
4099be2608d81 95d99817fc335 32 minutes ago Running kube-apiserver 6 be8677296d508
4ba5a75ffb0a6 60d957e44ec8a 40 minutes ago Running kube-scheduler 0 48dda9bf2a9e5
e0306a62e0e89 788e63d07298d 40 minutes ago Running kube-proxy 0 69a4c4384e835
strawberry (intended master):
pi@strawberry:~$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
6405e63c02be4 35af1df8416d5 30 minutes ago Running calico-node 0 034286a83df90
0e6c98f8a89df 788e63d07298d 31 minutes ago Running kube-proxy 0 0740ce290c2ff
30ba49df68752 60d957e44ec8a 40 minutes ago Running kube-scheduler 0 2dd66e00bd2be
4632943f39249 3a1a2b528610a 40 minutes ago Running kube-controller-manager 0 e32207bafed57
blackberry (worker):
pi@blackberry:~$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
2c6e7b609bacc acf4bc146ed19 29 minutes ago Running calico-kube-controllers 0 8888ae3c57ffc
3afbd3e2d83b7 db91994f4ee8f 30 minutes ago Running coredns 0 f5955392080c5
21f385bc19c1d db91994f4ee8f 30 minutes ago Running coredns 0 15e8ea135da08
c88c330892f02 35af1df8416d5 30 minutes ago Running calico-node 0 b4dd709f7fd86
4eb84272d3f4d 788e63d07298d 42 minutes ago Running kube-proxy 0 2cae842481891
In case this helps ^^:
> cat ansible/inventory
[all]
hackberry hostname=hackberry.almond.lan ansible_host=192.168.0.51 ansible_user=pi
strawberry hostname=strawberry.almond.lan ansible_host=192.168.0.52 ansible_user=pi
blueberry hostname=blueberry.almond.lan ansible_host=192.168.0.53 ansible_user=pi
blackberry hostname=blackberry.almond.lan ansible_host=192.168.0.54 ansible_user=pi
[cluster:children]
controlplane
nodes
[controlplane]
hackberry
strawberry
blueberry
[nodes]
blackberry
[docker_cache]
#registry hostname=registry ansible_host=192.168.1.120 ansible_user=pi
; These entires are here for backward compatibility as we transition away from the old names.
[k8s:children]
masters
workers
[masters]
hackberry
strawberry
blueberry
[workers]
blackberry
## Old variables #### # Role - keepalived #### keepalived_vip: 192.168.91.240
Ok, so this looks like testing leftovers snuck into the "defaults" - keepalived_vip
should normally be an empty string, like it is in the master.yml
.
The second issue is an odd one - the API server just doesn't seem to respond properly when another control-plane node attempts to join. There's a check that's built into the playbook to ensure the API server is available prior to attempting to join, sometimes it catches the delay, sometimes it doesn't.
If you execute the nuke playbook; ansible-playbook playbooks/nuke.yml
, it will reset the nodes to a state prior to creating the cluster. Executing the all
playbook from this state should succeed, please let us know if it is successful.
Details
What steps did you take and what happened:
Trying out the updated Ansible installation playbook today on a freshly imaged 20.04 four node pi4B cluster. Hit this failure mode when running
env ANSIBLE_CONFIG=ansible/ansible.cfg ansible-playbook ansible/playbooks/all.yml:
What did you expect to happen: All playbooks run to completion with a 4 node k8s cluster started up successfully.
Anything else you would like to add:
I searched for
192.168.91.240
in the local clone. Found it inside the the old variable section ofansible/group_vars/cluster.yml
...Additional Information: