Cannot re-run 1.12 playbook (eg to add nodes) - kubeadm rbac issue

erictcgs commented 5 years ago

/kind bug

What steps did you take and what happened:

Ran upgrade script from 1.11.6 cluster to 1.12.7, masters failed due to temporary api server inavailability, ansible aborted. kubectl get nodes showed that masters were successfully upgraded, tried to re-run script to make sure all plays were performed, script now fails on package install

What did you expect to happen:

Detect no change necessary on masters for stages that were successful, only apply needed changes

Anything else you would like to add:

$ ansible-playbook -i clus.yaml wardroom/swizzle/upgrade.yml
...
TASK [kubernetes-master : add all of the kubernetes add-ons] ********************************************************************************
fatal: [arcadeqa-clus104-master1-c82e66.vm.qis.site.gs.com]: FAILED! => {"changed": true, "cmd": ["kubeadm", "alpha", "phase", "addon", "all", "--config", "/etc/kubernetes/kubeadm.conf"], "delta": "0:00:00.039686", "end": "2019-08-16 12:10:17.833855", "msg": "non-zero return code", "rc": 1, "start": "2019-08-16 12:10:17.794169", "stderr": "Get https://api.c104.qis.site.gs.com:6443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp: lookup api.c104.qis.site.gs.com on 127.0.0.53:53: no such host", "stderr_lines": ["Get https://api.c104.qis.site.gs.com:6443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp: lookup api.c104.qis.site.gs.com on 127.0.0.53:53: no such host"], "stdout": "", "stdout_lines": []}
fatal: [arcadeqa-clus104-master2-f69131.vm.qis.site.gs.com]: FAILED! => {"changed": true, "cmd": ["kubeadm", "alpha", "phase", "addon", "all"
, "--config", "/etc/kubernetes/kubeadm.conf"], "delta": "0:00:00.066816", "end": "2019-08-16 12:10:17.972071", "msg": "non-zero return code", "rc": 1, "start": "2019-08-16 12:10:17.905255", "stderr": "Get https://api.c104.qis.site.gs.com:6443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp: lookup api.c104.qis.site.gs.com on 127.0.0.53:53: no such host", "stderr_lines": ["Get https://api.c104.qis.site.gs.com:6443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp: lookup api.c104.qis.site.gs.com on 127.0.0.53:53: no such host"], "stdout": "", "stdout_lines": []}
fatal: [arcadeqa-clus104-master3-765238.vm.qis.site.gs.com]: FAILED! => {"changed": true, "cmd": ["kubeadm", "alpha", "phase", "addon", "all", "--config", "/etc/kubernetes/kubeadm.conf"], "delta": "0:00:00.123016", "end": "2019-08-16 12:10:18.019232", "msg": "non-zero return code", "rc": 1, "start": "2019-08-16 12:10:17.896216", "stderr": "Get https://api.c104.qis.site.gs.com:6443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp: lookup api.c104.qis.site.gs.com on 127.0.0.53:53: no such host", "stderr_lines": ["Get https://api.c104.qis.site.gs.com:6443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp: lookup api.c104.qis.site.gs.com on 127.0.0.53:53: no such host"], "stdout": "", "stdout_lines": []}

$ ansible-playbook -i clus.yaml wardroom/swizzle/upgrade.yml
...
TASK [kubernetes : install kubernetes packages] *********************************************************************************************
FAILED - RETRYING: install kubernetes packages (5 retries left).
FAILED - RETRYING: install kubernetes packages (5 retries left).
FAILED - RETRYING: install kubernetes packages (5 retries left).
FAILED - RETRYING: install kubernetes packages (4 retries left).
FAILED - RETRYING: install kubernetes packages (4 retries left).
FAILED - RETRYING: install kubernetes packages (4 retries left).
FAILED - RETRYING: install kubernetes packages (3 retries left).
FAILED - RETRYING: install kubernetes packages (3 retries left).
FAILED - RETRYING: install kubernetes packages (3 retries left).
FAILED - RETRYING: install kubernetes packages (2 retries left).
FAILED - RETRYING: install kubernetes packages (2 retries left).
FAILED - RETRYING: install kubernetes packages (2 retries left).
FAILED - RETRYING: install kubernetes packages (1 retries left).
FAILED - RETRYING: install kubernetes packages (1 retries left).
FAILED - RETRYING: install kubernetes packages (1 retries left).
 [WARNING]: Could not find aptitude. Using apt-get instead

fatal: [arcadeqa-clus104-master2-f69131.vm.qis.site.gs.com]: FAILED! => {"attempts": 5, "cache_update_time": 1565971943, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"      install 'kubernetes-cni=0.6.0-00'' failed: E: Packages were downgraded and -y was used without --allow-downgrades.\n", "rc": 100, "stderr": "E: Packages were downgraded and -y was used without --allow-downgrades.\n", "stderr_lines": ["E: Packages were downgraded and -y was used without --allow-downgrades."], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nThe following packages were automatically installed and are no longer required:\n  conntrack cri-tools grub-pc-bin linux-headers-4.15.0-20\n  linux-headers-4.15.0-20-generic linux-image-4.15.0-20-generic\n  linux-modules-4.15.0-20-generic\nUse 'sudo apt autoremove' to remove them.\nThe following packages will be REMOVED:\n  kubeadm kubelet\nThe following packages will be DOWNGRADED:\n  kubernetes-cni\n0 upgraded, 0 newly installed, 1 downgraded, 2 to remove and 171 not upgraded.\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "The following packages were automatically installed and are no longer required:", "  conntrack cri-tools grub-pc-bin linux-headers-4.15.0-20", "  linux-headers-4.15.0-20-generic linux-image-4.15.0-20-generic", "  linux-modules-4.15.0-20-generic", "Use 'sudo apt autoremove' to remove them.", "The following packages will be REMOVED:", "  kubeadm kubelet", "The following packages will be DOWNGRADED:", "  kubernetes-cni", "0 upgraded, 0 newly installed, 1 downgraded, 2 to remove and 171 not upgraded."]}

Environment:

Wardroom version: branch 1.12
OS (e.g. from /etc/os-release): ubuntu 18.04

@craigtracey

erictcgs commented 5 years ago

Note that this also means nodes can't be added to the cluster - that requires the install playbook to run, must include etcd nodes (so that primary master has that variable set correctly), primary_master (to get kubeadm install token), then the nodes, however the playbook fails when trying to install packages on the master, then can't generate kubeadm token

erictcgs commented 5 years ago

It looks like the root of my original comment was a kubernetes_cni_version: "0.6.0-00" variable set in the ansible inventory from a previous version of kubernetes. This seems to be ignored in the upgrade scripts so those had worked and installed cni 0.7.5, but in the install scripts use that variable and it caused failure.

Unfortunately playbook still can't be run - running into this issue when adding a new node: https://github.com/kubernetes/kubeadm/issues/907:

# /usr/bin/kubeadm join api.hostname.com:6443 --token=6uouog.xxxx --discovery-token-unsafe-skip-ca-verification --ignore-preflight-errors=all
...
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.12" ConfigMap in the kube-system namespace
configmaps "kubelet-config-1.12" is forbidden: User "system:bootstrap:6uouog" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

From github this has been due to version mismatch, but here everything was installed/upgraded via wardroom and versions seem to match.

On master:

# apt list --installed | grep kuber
cri-tools/kubernetes-xenial,now 1.12.0-00 amd64 [installed,automatic]
kubeadm/kubernetes-xenial,now 1.12.7-00 amd64 [installed,upgradable to: 1.14.3-00]
kubectl/kubernetes-xenial,now 1.12.7-00 amd64 [installed,upgradable to: 1.14.3-00]
kubelet/kubernetes-xenial,now 1.12.7-00 amd64 [installed,upgradable to: 1.14.3-00]
kubernetes-cni/kubernetes-xenial,now 0.7.5-00 amd64 [installed]

# kubelet --version
Kubernetes v1.12.7

# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-25T02:49:02Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

On new node:

# apt list --installed | grep kuber
cri-tools/kubernetes-xenial,now 1.12.0-00 amd64 [installed,automatic]
kubeadm/kubernetes-xenial,now 1.12.7-00 amd64 [installed,upgradable to: 1.14.3-00]
kubectl/kubernetes-xenial,now 1.12.7-00 amd64 [installed,upgradable to: 1.14.3-00]
kubelet/kubernetes-xenial,now 1.12.7-00 amd64 [installed,upgradable to: 1.14.3-00]
kubernetes-cni/kubernetes-xenial,now 0.7.5-00 amd64 [installed]

# kubelet --version
Kubernetes v1.12.7

# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-25T02:49:02Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

craigtracey commented 5 years ago

What is the state of the scoped token you are trying to use during this run? Are you sure that it has not expired?

erictcgs commented 5 years ago

I'm using the token generated by wardroom on the master - it fails during wardroom node install, if I do kubeadm token list immediately on the master it is listed and seems valid, and if I do kubeadm join on the node manually with it (all within a minute or so of the initial ansible run) it fails with the same error that wardroom got.

Is there a role/rolebinding being misconfigured that's supposed to allow group system:bootstrappers:kubeadm:default-node-token to access those configmaps?

On the master:

root@arcadebackup-clus8-master1-9c250a:~# kubeadm token list
TOKEN                     TTL         EXPIRES                     USAGES                   DESCRIPTION   EXTRA GROUPS
kghrxz.jtfd32hksthjp7m9   23h         2019-09-04T14:57:01-04:00   authentication,signing   <none>        system:bootstrappers:kubeadm:default-node-token

On the node:

$ /usr/bin/kubeadm join api.c8....:6443 --token=kghrxz.jtfd32hksthjp7m9 --discovery-token-unsafe-skip-ca-verification --ignore-preflight-errors=all

[preflight] Running pre-flight checks
        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.2. Latest validated version: 18.06
[discovery] Trying to connect to API Server "api.c8....:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://api.c8....:6443"
[discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "api.c8....:6443"
[discovery] Successfully established connection with API Server "api.c8....:6443"
[join] Reading configuration from the cluster...
[join] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
unable to fetch the kubeadm-config ConfigMap: failed to get config map: configmaps "kubeadm-config" is forbidden: User "system:bootstrap:kghrxz" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

vmware-archive / wardroom

Cannot re-run 1.12 playbook (eg to add nodes) - kubeadm rbac issue #203