itskoko / kubecfn

Cloudformation based installer for reasonably secure multi-node kubeadm cluster.
28 stars 9 forks source link

Fix kubeadm race #4

Open discordianfish opened 6 years ago

discordianfish commented 6 years ago

Sometimes kubeadm fails, probably when it comes up before etcd reached quorum and fails (but can be restarted).

discordianfish commented 6 years ago

We have kubeadm.service run After=etcd-member.service, which makes it start after etcd gets started the first time. etcd might fail though for various reasons (e.g SRV record not updated yet) on the first 1-2 starts which leads kubeadm to fail. Since it's Type=oneshot it can't be restarted by systemd.

Here is the log from a rolling upgrade showing the problem:

Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.362023 I | embed: listening for client requests on 0.0.0.0:2379
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.388040 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-180-32.ec2.internal:2380
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.388746 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-182-220.ec2.internal:2380
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.411176 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-185-127.ec2.internal:2380
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.442449 C | etcdmain: error setting up initial cluster: cannot find local etcd member "ip-172-20-181-150.ec2.internal" in SRV records
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: etcd-member.service: Main process exited, code=exited, status=1/FAILURE
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: Failed to start etcd (System Application Container).
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: etcd-member.service: Unit entered failed state.
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: Starting Kubeadm init...
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [kubeadm] WARNING: kubeadm is currently in beta
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [init] Using Kubernetes version: v1.8.4
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [init] Using Authorization modes: [Node RBAC]
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [init] WARNING: For cloudprovider integrations to work --cloud-provider must be set for all kubelets in the cluster.
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]:         (/etc/systemd/system/kubelet.service.d/10-kubeadm.conf should be edited for this purpose)
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [preflight] Running pre-flight checks.
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING KubeletVersion]: couldn't get kubelet version: exec: "kubelet": executable file not found in $PATH
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING FileExisting-socat]: socat not found in system path
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING FileExisting-crictl]: crictl not found in system path
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
Jan 16 11:49:22 ip-172-20-181-150 systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Jan 16 11:49:22 ip-172-20-181-150 systemd[1]: Stopped etcd (System Application Container).
Jan 16 11:49:22 ip-172-20-181-150 systemd[1]: Starting etcd (System Application Container)...
Jan 16 11:49:22 ip-172-20-181-150 rkt[974]: rm: unable to resolve UUID from file: open /var/lib/coreos/etcd-member-wrapper.uuid: no such file or directory
Jan 16 11:49:22 ip-172-20-181-150 rkt[974]: rm: failed to remove one or more pods
Jan 16 11:49:22 ip-172-20-181-150 etcd-member-add[984]: Adding ourself to cluster
Jan 16 11:49:24 ip-172-20-181-150 etcd-member-add[984]: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://ip-172-20-185-127.ec2.internal.:2379 has no leader
Jan 16 11:49:24 ip-172-20-181-150 etcd-member-add[984]: ; error #1: client: etcd member https://ip-172-20-182-220.ec2.internal.:2379 has no leader
Jan 16 11:49:24 ip-172-20-181-150 etcd-member-add[984]: ; error #2: client: endpoint https://ip-172-20-180-32.ec2.internal.:2379 exceeded header timeout
Jan 16 11:49:24 ip-172-20-181-150 etcd-wrapper[992]: ++ id -u etcd
Jan 16 11:49:24 ip-172-20-181-150 etcd-wrapper[992]: + exec /usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd --mount volume=etcd-ssl,target=/etc/ssl/etcd --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.520919 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.521409 I | pkg/flags: recognized and used environment variable ETCD_DISCOVERY_SRV=int2.example.com
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.521669 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.521911 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=int2.example.com
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522186 I | pkg/flags: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522432 W | pkg/flags: unrecognized environment variable ETCD_USER=etcd
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522666 W | pkg/flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.1.10
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522916 I | etcdmain: etcd Version: 3.1.10
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523144 I | etcdmain: Git SHA: 0520cb9
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523368 I | etcdmain: Go Version: go1.8.3
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523618 I | etcdmain: Go OS/Arch: linux/amd64
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523841 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.524106 I | embed: peerTLS: cert = /etc/ssl/etcd/peer.crt, key = /etc/ssl/etcd/peer.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.525221 I | embed: listening for peers on https://172.20.181.150:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.525527 I | embed: listening for client requests on 0.0.0.0:2379
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.552553 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-185-127.ec2.internal:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.554191 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-180-32.ec2.internal:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.555779 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-182-220.ec2.internal:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.557213 C | etcdmain: error setting up initial cluster: cannot find local etcd member "ip-172-20-181-150.ec2.internal" in SRV records
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: etcd-member.service: Main process exited, code=exited, status=1/FAILURE
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: Failed to start etcd (System Application Container).
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: etcd-member.service: Unit entered failed state.
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Jan 16 11:49:30 ip-172-20-181-150 kubeadm[858]: [preflight] Some fatal errors occurred:
Jan 16 11:49:30 ip-172-20-181-150 kubeadm[858]:         [ERROR ExternalEtcdVersion]: couldn't parse external etcd version "": Version string empty