raspbernetes / k8s-cluster-installation

Bootstrap a k8s cluster with Ansible

https://raspbernetes.github.io/

Apache License 2.0

114 stars 32 forks source link

Feature/refactor cluster init join #101

Closed rkage closed 3 years ago

rkage commented 3 years ago

Description

This PR refactors the cluster init and join role.

crutonjohn commented 3 years ago

My run on a 3 node 1x2 cluster failed at joining the two nodes to the single control plane node:

fatal: [k-test02]: FAILED! => changed=true 
  cmd:
  - kubeadm
  - join
  - --config
  - /etc/kubernetes/kubeadm-join.yaml
  delta: '0:05:06.400650'
  end: '2021-02-12 01:10:44.492383'
  msg: non-zero return code
  rc: 1
  start: '2021-02-12 01:05:38.091733'
  stderr: |2-
            [WARNING SystemVerification]: missing optional cgroups: hugetlb
    error execution phase preflight: couldn't validate the identity of the API Server: Get "https://192.168.91.240:8443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    To see the stack trace of this error execute with --v=5 or higher
  stderr_lines: <omitted>
  stdout: '[preflight] Running pre-flight checks'
  stdout_lines: <omitted>

I'm investigating

crutonjohn commented 3 years ago

My run on a 3 node 1x2 cluster failed at joining the two nodes to the single control plane node:

fatal: [k-test02]: FAILED! => changed=true 
  cmd:
  - kubeadm
  - join
  - --config
  - /etc/kubernetes/kubeadm-join.yaml
  delta: '0:05:06.400650'
  end: '2021-02-12 01:10:44.492383'
  msg: non-zero return code
  rc: 1
  start: '2021-02-12 01:05:38.091733'
  stderr: |2-
            [WARNING SystemVerification]: missing optional cgroups: hugetlb
    error execution phase preflight: couldn't validate the identity of the API Server: Get "https://192.168.91.240:8443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    To see the stack trace of this error execute with --v=5 or higher
  stderr_lines: <omitted>
  stdout: '[preflight] Running pre-flight checks'
  stdout_lines: <omitted>

I'm investigating

False alarm. PEBCAK

crutonjohn commented 3 years ago

possibly needs local ansible to install netaddr via pip/3 -- at least on macbooks

crutonjohn commented 3 years ago

there are some weird issues with /etc/cni/* not getting removed, so clearing a nuke doesn't necessarily undo changes.

after running all.yml, i was stuck with a dead cluster. nuke.yml followed by rm -rf /etc/cni/* allowed me to deploy successfully.

i'm not sure why, but the remove cni net.d folder task isn't running?

crutonjohn commented 3 years ago

I know I kind of shoehorned a few fixes in here, but it looks like everything passes from all.yml to nuke.yml -- at least as far as 1x2 config with calico goes; the bare minimum is working.