kinvolk / kube-spawn

A tool for creating multi-node Kubernetes clusters on a Linux machine using kubeadm & systemd-nspawn. Brought to you by the Kinvolk team.
https://kinvolk.io
Apache License 2.0
441 stars 42 forks source link

fails to start with a timeout with Kubernetes 1.11 #282

Open alban opened 6 years ago

alban commented 6 years ago

To Reproduce:

Workarounds

sudo setenforce 0

Install dependencies

sudo dnf install -y btrfs-progs git go iptables libselinux-utils polkit qemu-img systemd-container make docker mkdir go export GOPATH=$HOME/go curl -fsSL -O https://github.com/containernetworking/plugins/releases/download/v0.6.0/cni-plugins-amd64-v0.6.0.tgz sudo mkdir -p /opt/cni/bin sudo tar -C /opt/cni/bin -xvf cni-plugins-amd64-v0.6.0.tgz sudo curl -Lo /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${KUBERNETES_VERSION}/bin/linux/amd64/kubectl sudo chmod +x /usr/local/bin/kubectl

Compile and install

mkdir -p $GOPATH/src/github.com/kinvolk cd $GOPATH/src/github.com/kinvolk git clone https://github.com/kinvolk/kube-spawn.git cd kube-spawn/ git checkout $KUBE_SPAWN_VERSION make DOCKERIZED=n sudo make install

First attempt to use kube-spawn

cd sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION sudo -E kube-spawn start --nodes=3 sudo -E kube-spawn destroy

Workaround for "no space left on device": https://github.com/kinvolk/kube-spawn/issues/281

sudo umount /var/lib/machines sudo qemu-img resize -f raw /var/lib/machines.raw $((1010241024*1024)) sudo mount -t btrfs -o loop /var/lib/machines.raw /var/lib/machines sudo btrfs filesystem resize max /var/lib/machines sudo btrfs quota disable /var/lib/machines

Start kube-spawn

cd sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION sudo -E kube-spawn start --nodes=3


Then the error message:

Download of https://alpha.release.flatcar-linux.net/amd64-usr/current/flatcar_developer_container.bin.bz2 complete. Created new local image 'flatcar'. Operation completed successfully. Exiting. nf_conntrack module is not loaded: stat /sys/module/nf_conntrack/parameters/hashsize: no such file or directory Warning: nf_conntrack module is not loaded. loading nf_conntrack module... making iptables FORWARD chain defaults to ACCEPT... setting iptables rule to allow CNI traffic... Starting 3 nodes in cluster default ... Waiting for machine kube-spawn-default-worker-fjxan9 to start up ... Waiting for machine kube-spawn-default-master-5y7clq to start up ... Waiting for machine kube-spawn-default-worker-2ujr2f to start up ... Started kube-spawn-default-worker-2ujr2f Bootstrapping kube-spawn-default-worker-2ujr2f ... Started kube-spawn-default-master-5y7clq Bootstrapping kube-spawn-default-master-5y7clq ... Cluster "default" started Failed to start machine kube-spawn-default-worker-fjxan9: timeout waiting for "kube-spawn-default-worker-fjxan9" to start Note: kubeadm init can take several minutes master-5y7clq I0630 14:22:29.999557 380 feature_gate.go:230] feature gates: &{map[]} [init] using Kubernetes version: v1.11.0 [preflight] running pre-flight checks [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service' [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist [WARNING FileExisting-crictl]: crictl not found in system path I0630 14:22:30.050775 380 kernel_validator.go:81] Validating kernel version I0630 14:22:30.051083 380 kernel_validator.go:96] Validating kernel config [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03 [WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" could not be reached [WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" lookup kube-spawn-default-master-5y7clq on 8.8.8.8:53: no such host reflight/images] Pulling images required for setting up a Kubernetes cluster [preflight/images] This might take a minute or two, depending on the speed of your internet connection [preflight/images] You can also perform this action in beforehand using 'kubeadm config images pull' [kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" [kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml" [preflight] Activating the kubelet service [certificates] Generated ca certificate and key. [certificates] Generated apiserver certificate and key. [certificates] apiserver serving cert is signed for DNS names [kube-spawn-default-master-5y7clq kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.22.0.3] [certificates] Generated apiserver-kubelet-client certificate and key. [certificates] Generated sa key and public key. [certificates] Generated front-proxy-ca certificate and key. [certificates] Generated front-proxy-client certificate and key. [certificates] Generated etcd/ca certificate and key. [certificates] Generated etcd/server certificate and key. [certificates] etcd/server serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [127.0.0.1 ::1] [certificates] Generated etcd/peer certificate and key. [certificates] etcd/peer serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [10.22.0.3 127.0.0.1 ::1] [certificates] Generated etcd/healthcheck-client certificate and key. [certificates] Generated apiserver-etcd-client certificate and key. [certificates] valid certificates and keys now exist in "/etc/kubernetes/pki" [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf" [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf" [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf" [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf" [controlplane] wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml" [controlplane] wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml" [controlplane] wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml" [etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml" [init] waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests" [init] this might take a minute or longer if the control plane images have to be pulled [apiclient] All control plane components are healthy after 42.001677 seconds [uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace [kubelet] Creating a ConfigMap "kubelet-config-1.11" in namespace kube-system with the configuration for the kubelets in the cluster [markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the label "node-role.kubernetes.io/master=''" [markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the taints [node-role.kubernetes.io/master:NoSchedule] [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-master-5y7clq" as an annotation [bootstraptoken] using token: 1o71nu.v7s48wncryhbdmm7 [bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials [bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token [bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster [bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace [addons] Applied essential addon: CoreDNS [addons] Applied essential addon: kube-proxy Your Kubernetes master has initialized successfully! To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config You should now deploy a pod network to the cluster. Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: https://kubernetes.io/docs/concepts/cluster-administration/addons/ You can now join any number of machines by running the following on each node as root: kubeadm join 10.22.0.3:6443 --token 1o71nu.v7s48wncryhbdmm7 --discovery-token-ca-cert-hash sha256:c8ac2337adc7ed01725bed7d78605661dc759257fce213838f1cb89486fe263c I0630 14:23:47.569329 1140 feature_gate.go:230] feature gates: &{map[]} aaaaaa.bbbbbbbbbbbbbbbb serviceaccount/weave-net created clusterrole.rbac.authorization.k8s.io/weave-net created clusterrolebinding.rbac.authorization.k8s.io/weave-net created daemonset.extensions/weave-net created worker-2ujr2f [preflight] running pre-flight checks [WARNING RequiredIPVSKernelModulesAvailable]: the IPVS proxier will not be used, because the following required kernel modules are not loaded: [ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh] or no builtin kernel ipvs support: map[ip_vs:{} ip_vs_rr:{} ip_vs_wrr:{} ip_vs_sh:{} nf_conntrack_ipv4:{}] you can solve this problem with following methods:

  1. Run 'modprobe -- ' to load missing kernel modules;
  2. Provide the missing builtin kernel ipvs support [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service' [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist [WARNING FileExisting-crictl]: crictl not found in system path I0630 14:23:49.919029 449 kernel_validator.go:81] Validating kernel version I0630 14:23:49.919338 449 kernel_validator.go:96] Validating kernel config [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03 [WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" could not be reached [WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" lookup kube-spawn-default-worker-2ujr2f on 8.8.8.8:53: no such host [discovery] Trying to connect to API Server "10.22.0.3:6443" [discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443" [discovery] Failed to connect to API Server "10.22.0.3:6443": token id "aaaaaa" is invalid for this cluster or it has expired. Use "kubeadm token create" on the master node to creating a new valid token [discovery] Trying to connect to API Server "10.22.0.3:6443" [discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443" [discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "10.22.0.3:6443" [discovery] Successfully established connection with API Server "10.22.0.3:6443" [kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace [kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml" [kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" [preflight] Activating the kubelet service [tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap... [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-worker-2ujr2f" as an annotation This node has joined the cluster:
    • Certificate signing request was sent to master and a response was received.
    • The Kubelet was informed of the new secure connection details. Run 'kubectl get nodes' on the master to see this node join the cluster. Failed to start cluster: provisioning the worker nodes with kubeadm didn't succeed

More debug info:

$ kubectl get nodes
NAME                               STATUS    ROLES     AGE       VERSION
kube-spawn-default-master-5y7clq   Ready     master    1m        v1.11.0
kube-spawn-default-worker-2ujr2f   Ready     <none>    46s       v1.11.0
$ machinectl 
MACHINE                          CLASS     SERVICE        OS      VERSION  ADDRESSES
kube-spawn-default-master-5y7clq container systemd-nspawn flatcar 1814.0.0 10.22.0.3...
kube-spawn-default-worker-2ujr2f container systemd-nspawn flatcar 1814.0.0 10.22.0.2...

2 machines listed.
$ df -h /var/lib/machines
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       10G  1.7G  7.8G  18% /var/lib/machines

The third machine does not exist anymore?

alban commented 6 years ago

After a second attempt, it works.

arcolife commented 5 years ago

I get this timeout just as @alban described, except it's reproducible every time.

$ kube-spawn start
Warning: kube-proxy could crash due to insufficient nf_conntrack hashsize.
setting nf_conntrack hashsize to 131072... 
making iptables FORWARD chain defaults to ACCEPT...
new poolSize to be : 5490739200
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-naz6fc to start up ...
Waiting for machine kube-spawn-default-master-yz3twq to start up ...
Waiting for machine kube-spawn-default-worker-u5fu6n to start up ...
Failed to start machine kube-spawn-default-master-yz3twq: timeout waiting for "kube-spawn-default-master-yz3twq" to start
Failed to start machine kube-spawn-default-worker-naz6fc: timeout waiting for "kube-spawn-default-worker-naz6fc" to start
Failed to start cluster: starting the cluster didn't succeed

Note:

  1. I face the same timeout issue, regardless of when I destroy the cluster and start again. Or if I mount a formatted btrfs and redo this.
  2. The first time I launched kube-spawn, it was with a manually formatted and mounted btrfs volume. That's when it complained "machine.raw" not found. So I unmounted and re-ran. So the systemd-nspawn did its job and created a machine.raw. I tried to re-spawn the cluster afterwards, except this time it didn't complain about .raw file obviously. But it timed out regardless.
  3. Even though I've been through the troubleshooting.md guide, SELinux has been a pita and as a result I've had to create about a dozen policies and semanage it all. Not the cake I was digging. pfft

for debugging, is there any place this things logs itself into?


OR

/dev/sda4 btrfs 56G 1.7G 54G 4% /var/lib/machines

- `systemd-container-238-10.git438ac26.fc28.x86_64`
- `qemu-img-2.11.2-4.fc28.x86_64`
- machinectl limit to 40G with loopback mount (as evident in the df output above too):

machinectl show

PoolPath=/var/lib/machines PoolUsage=1866190848 PoolLimit=42949672960


- OS: `Linux 4.18.17-200.fc28.x86_64 GNU/Linux`
arcolife commented 5 years ago

ok nevermind.

all I had to do was:

  1. export KUBERNETES_VERSION=v1.12.0 (didn't do it earlier before create step)
  2. kube-spawn destroy
  3. kube-spawn create (this time, it populated /var/lib/kube-spawn/clusters. It was an empty trail of subdirs earlier.)
  4. kube-spawn start

and it works. jeez

krnowak commented 5 years ago

Seems to be related to #325.

arcolife commented 5 years ago

Seems to be related to #325.

sure, except I didn't destroy it first. Got the timeout from start as per https://github.com/kinvolk/kube-spawn/issues/282#issuecomment-437786972 (so to speak, after creating the cluster) ..then resolved issue with https://github.com/kinvolk/kube-spawn/issues/282#issuecomment-437790311

apologies if that order in step 2 of resolution comment, created a confusion.

also I can't reproduce it now. :/