kube-flannel going into CrashLoop

balchua commented 6 years ago

Bug

Kube-flannel goes into crash loop. This happens not on all flannel pods. Error is unknown host. It seems like it could not contact the api server.

Environment

Platform: digital-ocean
OS: container-linux
Terraform: 0.11.1
Plugins: Provider plugin versions
Ref: Git SHA (if applicable)v 1.9.3

Problem

Describe the problem. When i bootstrap a kubernetes cluster on DIgital Ocean, kube-flannel goes into crash loop. This does not happen on all pods. For example I bootstrapped a 4 worker nodes, either 1 or 2 pods goes into crash loop. The error reported as "unknown host" when it is trying to connect to the api server.

DUe to this nginx add on does not work anymore. Other pods' status are Running.

Desired Behavior

Flannel in stable or running state.

Describe the goal.

Steps to Reproduce

I simply follow the steps described in digital ocean distribution of Typhoon.

dghubble commented 6 years ago

You'll have to do a bit more debugging. The DO test clusters for v1.9.3 don't reproduce this problem.

One thing to pay attention to on DO is that the default VMs are quite small compared with other cloud providers. Be sure this isn't related to resource contention.

balchua commented 6 years ago

Sorry I'm quite new to this. I will try bumping up the size of the droplet.
The strange thing though is that, it haven't really put any app into it. Just the basic components. I tried with the s-2vcpu-4gb for both the controller and the worker.

I haven't experience this strange behavior while using the 1.8.6 (i think). Another weird thing is that sometimes when I do kubectl get nodes, i dont see all the nodes listed. But i can see that flannel pods are on all workers. Have you experience such behavior too?

dghubble commented 6 years ago

No, I don't see this. Can you post your module definition? Is it specific to certain regions (I see a few are under maintenance)? Is your DNS setup ok (you don't have some insane long TTL that still points to something else)? Using coreos-stable? There is not enough info here to provide much real help.

Also please post the (formatted) flannel logs.

balchua commented 6 years ago

Thanks for getting back on this one. Awesome work by the way. Yes i am using coreos-stable

Here are the flannel logs, these are all i can grab directly from the docker logs..

{"log":"I0216` 01:16:07.561144       1 main.go:416] Searching for interface using 10.130.51.129\n","stream":"stderr","time":"2018-02-16T01:16:07.561601825Z"}
{"log":"I0216 01:16:07.561937       1 main.go:487] Using interface with name eth1 and address 10.130.51.129\n","stream":"stderr","time":"2018-02-16T01:16:07.562230174Z"}
{"log":"I0216 01:16:07.562070       1 main.go:504] Defaulting external address to interface address (10.130.51.129)\n","stream":"stderr","time":"2018-02-16T01:16:07.562254728Z"}
{"log":"E0216 01:16:10.662655       1 main.go:231] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-bzgks': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-bzgks: dial tcp 10.3.0.1:443: getsockopt: no route to host\n","stream":"stderr","time":"2018-02-16T01:16:10.667137665Z"}

further logs from journalctl

Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: I0216 01:24:15.866831    1248 kuberuntime_manager.go:503] Container {Name:kube-flannel Image:quay.io/coreos/flannel:v0.9.1-amd64 Command:[/opt/bin/flanneld --ip-masq --kube-subnet-mgr --iface=$(POD_IP)] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:POD_NAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:run ReadOnly:false MountPath:/run SubPath: MountPropagation:<nil>} {Name:cni ReadOnly:false MountPath:/etc/cni/net.d SubPath: MountPropagation:<nil>} {Name:flannel-cfg ReadOnly:false MountPath:/etc/kube-flannel/ SubPath: MountPropagation:<nil>} {Name:default-token-vkvsc ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: I0216 01:24:15.867137    1248 kuberuntime_manager.go:742] checking backoff for container "kube-flannel" in pod "kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)"
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: I0216 01:24:15.867293    1248 kuberuntime_manager.go:752] Back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: E0216 01:24:15.867334    1248 pod_workers.go:182] Error syncing pod 8939894d-12b4-11e8-8fc1-26d06e90ea57 ("kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)"), skipping: failed to "StartContainer" for "kube-flannel" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)"

Module definition

module "digital-ocean-btc" {
  source = "git::https://github.com/poseidon/typhoon//digital-ocean/container-linux/kubernetes?ref=v1.9.3"

  region = "sgp1"
  dns_zone = "geek.per.sg"

  cluster_name = "btc"
  image = "coreos-stable"
  controller_count = 1
  controller_type = "s-4vcpu-8gb"
  worker_count = 3
  worker_type = "s-4vcpu-8gb"
  ssh_fingerprints = ["${var.digitalocean_ssh_fingerprint}"]

  # output assets dir
  asset_dir = "/home/thor/.secrets/clusters/btc"
}

Yes DNS seems to be working fine. I didn't set the TTL explicitly, it took the default which is 300. When I described the kube-api pod, i get this

Events:
  Type     Reason                 Age                From                   Message
  ----     ------                 ----               ----                   -------
  Normal   SuccessfulMountVolume  30m                kubelet, 10.130.50.55  MountVolume.SetUp succeeded for volume "var-lock"
  Normal   SuccessfulMountVolume  30m                kubelet, 10.130.50.55  MountVolume.SetUp succeeded for volume "ssl-certs-host"
  Normal   SuccessfulMountVolume  30m                kubelet, 10.130.50.55  MountVolume.SetUp succeeded for volume "default-token-wxrsm"
  Normal   SuccessfulMountVolume  30m                kubelet, 10.130.50.55  MountVolume.SetUp succeeded for volume "secrets"
  Warning  BackOff                30m (x4 over 30m)  kubelet, 10.130.50.55  Back-off restarting failed container
  Normal   Pulled                 29m (x4 over 30m)  kubelet, 10.130.50.55  Container image "gcr.io/google_containers/hyperkube:v1.9.3" already present on machine
  Normal   Created                29m (x4 over 30m)  kubelet, 10.130.50.55  Created container
  Normal   Started                29m (x4 over 30m)  kubelet, 10.130.50.55  Started container
  Warning  FailedMount            29m (x2 over 29m)  kubelet, 10.130.50.55  MountVolume.SetUp failed for volume "default-token-wxrsm" : Get https://btc.geek.per.sg:443/api/v1/namespaces/kube-system/secrets/default-token-wxrsm: dial tcp 159.89.200.16:443: getsockopt: connection refused

Eventhough there is this warning, kube-api pod is running.

I moved to using nyc3 region, tried to recreate the cluster twice. So far it didn't have those strange flannel issue. I will try to observe this more.

Again many thanks for the help!

balchua commented 6 years ago

Closing this one. I haven't experienced the issue since moving to nyc3 region.

dghubble commented 6 years ago

tldr: I believe this is a problem with Digital Ocean private networking in the sgp1 region. Its not related to Kubernetes or flannel as far as I can tell.

Typhoon Digital Ocean test clusters run in nyc3 and I can't reproduce the issue there, which aligns with what you're seeing. Spinning up a cluster in sgp1:

Investigating

Control plane bootstrapping completes successfully. Terraform apply completes successfully. The Kubernetes control plane is technically healthy. Flannel pods on controllers are healthy. However, all flannel pods on workers crash loop which aligns with your initial report.

Worker kubelet logs show correct and scuccessful registration using the node's private address. Worker kubelets can contact the apiserver.
Firewall rules exist to permit TCP and UDP traffic between controllers and workers (otherwise the registration would have failed too).
Logging and exec to any pod on a worker fails. This suggests the apiserver is unable to reach worker kubelets. These services use host networking, so its unrelated to overlays.

Evidence

Workers register with private IPs to use DO private networking. On a healthy DO cluster (say nyc3), you can SSH to a controller and curl the kubelet health endpoint of a worker.

# good
ssh core@controller.blah
curl curl http://127.0.0.1:10255
404 page not found

Only in clusters in sgp1, the controller can't route to the worker over private networking (You can SSH into the worker via its public IP and very kubelet is indeed running).

# bad
ssh core@controller.blah
curl http://10.130.74.71:10255                                    
curl: (7) Failed to connect to 10.130.74.71 port 10255: No route to host

# controller
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000                                                                                         
    link/ether 32:c3:be:97:84:a0 brd ff:ff:ff:ff:ff:ff                                         
    inet 10.130.74.11/16 brd 10.130.255.255 scope global eth1                                  
       valid_lft forever preferred_lft forever 
    inet6 fe80::30c3:beff:fe97:84a0/64 scope link                                              
       valid_lft forever preferred_lft forever 
# worker
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 2e:10:ec:be:4a:f2 brd ff:ff:ff:ff:ff:ff                                         
    inet 10.130.74.71/16 brd 10.130.255.255 scope global eth1                                  
       valid_lft forever preferred_lft forever 
    inet6 fe80::2c10:ecff:febe:4af2/64 scope link                                              
       valid_lft forever preferred_lft forever

An even simpler test is that you can normally SSH into a controller (with agent forwarding) and then SSH to a worker via its private IP. That doesn't work between droplets created in sgp1.

Its clear this is a host-level, private networking issue between droplets. Or perhaps the firewall rules we're creating aren't actually being applied in reality? As far as I can tell, this isn't related to flannel or Kubernetes or anything Typhoon is doing wrong.

I'm content to say for now don't pick Digital Ocean sgp1.

dghubble commented 6 years ago

@joonas

poseidon / typhoon