Cluster doesn't start when using Calico and etcd v3

tgrosinger commented 6 years ago

What kops version are you running? 1.8.0
What Kubernetes version are you running? go1.8.3
What cloud provider are you using? AWS
What commands did you run? What is the simplest way to reproduce this issue?

kops create cluster ... # See the cluster manifest below
kops update cluster --yes

What happened after the commands executed? What did you expect to happen?

When the below yaml is used to create a cluster, the following is the resulting state of the pods in the cluster:

core@ip-172-20-34-39 ~ $ kubectl --namespace kube-system get pods
NAME                                                                  READY     STATUS              RESTARTS   AGE
calico-kube-controllers-6b5f557d7d-pg5vn                              1/1       Running             0          4m
calico-node-7tlbf                                                     1/2       CrashLoopBackOff    5          4m
calico-node-dsb7g                                                     1/2       CrashLoopBackOff    4          2m
calico-node-twcqr                                                     1/2       CrashLoopBackOff    4          2m
calico-node-v4jhk                                                     1/2       CrashLoopBackOff    5          4m
calico-node-w7p2r                                                     1/2       CrashLoopBackOff    5          4m
calico-node-wxl56                                                     1/2       CrashLoopBackOff    3          2m
dns-controller-65f86fb6cf-9gjsr                                       1/1       Running             0          4m
etcd-server-events-ip-172-20-34-39.us-west-2.compute.internal         1/1       Running             0          4m
etcd-server-events-ip-172-20-34-50.us-west-2.compute.internal         1/1       Running             0          3m
etcd-server-events-ip-172-20-57-230.us-west-2.compute.internal        1/1       Running             0          3m
etcd-server-ip-172-20-34-39.us-west-2.compute.internal                1/1       Running             0          3m
etcd-server-ip-172-20-34-50.us-west-2.compute.internal                1/1       Running             0          3m
etcd-server-ip-172-20-57-230.us-west-2.compute.internal               1/1       Running             0          3m
kube-apiserver-ip-172-20-34-39.us-west-2.compute.internal             1/1       Running             0          3m
kube-apiserver-ip-172-20-34-50.us-west-2.compute.internal             1/1       Running             0          4m
kube-apiserver-ip-172-20-57-230.us-west-2.compute.internal            1/1       Running             0          3m
kube-controller-manager-ip-172-20-34-39.us-west-2.compute.internal    1/1       Running             0          3m
kube-controller-manager-ip-172-20-34-50.us-west-2.compute.internal    1/1       Running             0          4m
kube-controller-manager-ip-172-20-57-230.us-west-2.compute.internal   1/1       Running             0          3m
kube-dns-7f56f9f8c7-87vfn                                             0/3       ContainerCreating   0          4m
kube-dns-autoscaler-f4c47db64-hhr48                                   0/1       ContainerCreating   0          4m
kube-proxy-ip-172-20-32-217.us-west-2.compute.internal                1/1       Running             0          2m
kube-proxy-ip-172-20-34-39.us-west-2.compute.internal                 1/1       Running             0          4m
kube-proxy-ip-172-20-34-50.us-west-2.compute.internal                 1/1       Running             0          4m
kube-proxy-ip-172-20-47-122.us-west-2.compute.internal                1/1       Running             0          1m
kube-proxy-ip-172-20-52-129.us-west-2.compute.internal                1/1       Running             0          1m
kube-proxy-ip-172-20-57-230.us-west-2.compute.internal                1/1       Running             0          4m
kube-scheduler-ip-172-20-34-39.us-west-2.compute.internal             1/1       Running             0          3m
kube-scheduler-ip-172-20-34-50.us-west-2.compute.internal             1/1       Running             0          4m
kube-scheduler-ip-172-20-57-230.us-west-2.compute.internal            1/1       Running             0          3m

core@ip-172-20-34-39 ~ $ kubectl --namespace kube-system logs calico-node-7tlbf calico-node
Skipping datastore connection test
ERROR: Unable to access datastore to query node configuration
Terminating
Calico node failed to start

core@ip-172-20-34-39 ~ $ kubectl --namespace kube-system describe pod kube-dns-7f56f9f8c7-87vfn

...

Events:
  Type     Reason                  Age               From                                                  Message
  ----     ------                  ----              ----                                                  -------
  Warning  FailedScheduling        6m (x3 over 6m)   default-scheduler                                     no nodes available to schedule pods
  Warning  FailedScheduling        6m (x2 over 6m)   default-scheduler                                     No nodes are available that match all of the predicates: NodeNotReady (2), PodToleratesNodeTaints (2).
  Warning  FailedScheduling        6m (x4 over 6m)   default-scheduler                                     No nodes are available that match all of the predicates: NodeNotReady (3), PodToleratesNodeTaints (3).
  Warning  FailedScheduling        5m (x2 over 5m)   default-scheduler                                     No nodes are available that match all of the predicates: Insufficient cpu (3), NodeNotReady (1), PodToleratesNodeTaints (3).
  Warning  FailedScheduling        4m (x2 over 4m)   default-scheduler                                     No nodes are available that match all of the predicates: Insufficient cpu (3), NodeNotReady (3), PodToleratesNodeTaints (3).
  Normal   Scheduled               3m                default-scheduler                                     Successfully assigned kube-dns-7f56f9f8c7-87vfn to ip-172-20-47-122.us-west-2.compute.internal
  Normal   SuccessfulMountVolume   3m                kubelet, ip-172-20-47-122.us-west-2.compute.internal  MountVolume.SetUp succeeded for volume "kube-dns-config"
  Normal   SuccessfulMountVolume   3m                kubelet, ip-172-20-47-122.us-west-2.compute.internal  MountVolume.SetUp succeeded for volume "kube-dns-token-z5bzr"
  Warning  FailedCreatePodSandBox  3m                kubelet, ip-172-20-47-122.us-west-2.compute.internal  Failed create pod sandbox.
  Warning  FailedSync              1m (x11 over 3m)  kubelet, ip-172-20-47-122.us-west-2.compute.internal  Error syncing pod
  Normal   SandboxChanged          1m (x11 over 3m)  kubelet, ip-172-20-47-122.us-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

However, when I remove the section which sets the Etcd to v3 the cluster will start just fine. I have compared the yaml for the 3 types of pods which are not starting, and other than normal things (like pod name and ip addresses) the definitions are the same.

It seems that starting a cluster with Calico networking and etcd v3 does not function currently.

Please provide your cluster manifest.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-12-11T17:23:31Z
  name: my-cluster
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Project: hopcloud-cluster
  cloudProvider: aws
  configBase: s3://my-state-store/my-cluster
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a-1
      name: master-us-west-2a-1
      encryptedVolume: true
    - instanceGroup: master-us-west-2a-2
      name: master-us-west-2a-2
      encryptedVolume: true
    - instanceGroup: master-us-west-2a-3
      name: master-us-west-2a-3
      encryptedVolume: true
    enableEtcdTLS: true
    name: main
    version: 3.0.17
  - etcdMembers:
    - instanceGroup: master-us-west-2a-1
      name: master-us-west-2a-1
      encryptedVolume: true
    - instanceGroup: master-us-west-2a-2
      name: master-us-west-2a-2
      encryptedVolume: true
    - instanceGroup: master-us-west-2a-3
      name: master-us-west-2a-3
      encryptedVolume: true
    enableEtcdTLS: true
    name: events
    version: 3.0.17
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 1.2.3.4/28
  kubernetesVersion: 1.8.4
  masterPublicName: api.my-cluster
  networkCIDR: 172.20.0.0/16
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 1.2.3.4/28
  subnets:
  - cidr: 172.20.32.0/19
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 172.20.0.0/22
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  topology:
    bastion:
      bastionPublicName: bastion.my-cluster
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-12-11T17:23:31Z
  labels:
    kops.k8s.io/cluster: my-cluster
  name: bastions
spec:
  image: 595879546273/CoreOS-stable-1576.4.0-hvm
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-12-11T17:23:31Z
  labels:
    kops.k8s.io/cluster: my-cluster
  name: master-us-west-2a-1
spec:
  image: 595879546273/CoreOS-stable-1576.4.0-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a-1
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-12-11T17:23:31Z
  labels:
    kops.k8s.io/cluster: my-cluster
  name: master-us-west-2a-2
spec:
  image: 595879546273/CoreOS-stable-1576.4.0-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a-2
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-12-11T17:23:31Z
  labels:
    kops.k8s.io/cluster: my-cluster
  name: master-us-west-2a-3
spec:
  image: 595879546273/CoreOS-stable-1576.4.0-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a-3
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-12-11T17:23:31Z
  labels:
    kops.k8s.io/cluster: my-cluster
  name: nodes
spec:
  image: 595879546273/CoreOS-stable-1576.4.0-hvm
  machineType: t2.small
  maxSize: 3
  minSize: 3
  role: Node
  subnets:
  - us-west-2a

tgrosinger commented 6 years ago

The etcd configuration in the cluster manifest was copied from the documentation updates made in 7c2ce19 by @gambol99

chrislovecnm commented 6 years ago

/cc @blakebarnett @caseydavenport

blakebarnett commented 6 years ago

I don't see anything about etcdv3 in here... ?

tgrosinger commented 6 years ago

Oops, I posted the working version of the manifest. I updated the original post with the non-working version that is configured for etcdv3.

blakebarnett commented 6 years ago

Ok, this is why. TLS is not supported, this is why I mentioned it in the calico upgrade PR ;)

caseydavenport commented 6 years ago

Calico supports TLS to etcd, so naively we could just add that to the manfiest? I'm a bit distant from the kops nitty-gritty details though. Are the certs etc. available through k8s secrets?

blakebarnett commented 6 years ago

Yeah you could modify the YAML in the state store s3 bucket and it'd work until the next kops update.

tgrosinger commented 6 years ago

Ok, this is why. TLS is not supported, this is why I mentioned it in the calico upgrade PR ;)

Oh I missed that. I removed that setting and things seem to be coming up correctly. Let me do a little more validation and then I will close this issue. Thank you!

martianturkey commented 6 years ago

Hi, I've hit the exact same issue having updated the YAML to use etcd v3 & enableEtcdTLS: true as per https://github.com/kubernetes/kops/blob/master/docs/cluster_spec.md

@blakebarnett , you mentioned modifying the yaml to allow Calico to work with TLS. Within addons/networking.projectcalico.org?

elisiano commented 6 years ago

I'm facing the same issue. If someone can give some sort of explanation on how to get it working (if possible at all) that would be greatly appreciated.

chrislovecnm commented 6 years ago

Calico and TLS together are not supported at this time. You can open a feature request if you like.

kubernetes / kops

Cluster doesn't start when using Calico and etcd v3 #4039