Closed balchua closed 6 years ago
You'll have to do a bit more debugging. The DO test clusters for v1.9.3 don't reproduce this problem.
One thing to pay attention to on DO is that the default VMs are quite small compared with other cloud providers. Be sure this isn't related to resource contention.
Sorry I'm quite new to this. I will try bumping up the size of the droplet.
The strange thing though is that, it haven't really put any app into it. Just the basic components. I tried with the s-2vcpu-4gb for both the controller and the worker.
I haven't experience this strange behavior while using the 1.8.6 (i think). Another weird thing is that sometimes when I do kubectl get nodes, i dont see all the nodes listed. But i can see that flannel pods are on all workers. Have you experience such behavior too?
No, I don't see this. Can you post your module definition? Is it specific to certain regions (I see a few are under maintenance)? Is your DNS setup ok (you don't have some insane long TTL that still points to something else)? Using coreos-stable? There is not enough info here to provide much real help.
Also please post the (formatted) flannel logs.
Thanks for getting back on this one. Awesome work by the way. Yes i am using coreos-stable
Here are the flannel logs, these are all i can grab directly from the docker logs..
{"log":"I0216` 01:16:07.561144 1 main.go:416] Searching for interface using 10.130.51.129\n","stream":"stderr","time":"2018-02-16T01:16:07.561601825Z"}
{"log":"I0216 01:16:07.561937 1 main.go:487] Using interface with name eth1 and address 10.130.51.129\n","stream":"stderr","time":"2018-02-16T01:16:07.562230174Z"}
{"log":"I0216 01:16:07.562070 1 main.go:504] Defaulting external address to interface address (10.130.51.129)\n","stream":"stderr","time":"2018-02-16T01:16:07.562254728Z"}
{"log":"E0216 01:16:10.662655 1 main.go:231] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-bzgks': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-bzgks: dial tcp 10.3.0.1:443: getsockopt: no route to host\n","stream":"stderr","time":"2018-02-16T01:16:10.667137665Z"}
further logs from journalctl
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: I0216 01:24:15.866831 1248 kuberuntime_manager.go:503] Container {Name:kube-flannel Image:quay.io/coreos/flannel:v0.9.1-amd64 Command:[/opt/bin/flanneld --ip-masq --kube-subnet-mgr --iface=$(POD_IP)] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:POD_NAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:run ReadOnly:false MountPath:/run SubPath: MountPropagation:<nil>} {Name:cni ReadOnly:false MountPath:/etc/cni/net.d SubPath: MountPropagation:<nil>} {Name:flannel-cfg ReadOnly:false MountPath:/etc/kube-flannel/ SubPath: MountPropagation:<nil>} {Name:default-token-vkvsc ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: I0216 01:24:15.867137 1248 kuberuntime_manager.go:742] checking backoff for container "kube-flannel" in pod "kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)"
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: I0216 01:24:15.867293 1248 kuberuntime_manager.go:752] Back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)
Feb 16 01:24:15 btc-worker-1 kubelet-wrapper[1248]: E0216 01:24:15.867334 1248 pod_workers.go:182] Error syncing pod 8939894d-12b4-11e8-8fc1-26d06e90ea57 ("kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)"), skipping: failed to "StartContainer" for "kube-flannel" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-bzgks_kube-system(8939894d-12b4-11e8-8fc1-26d06e90ea57)"
Module definition
module "digital-ocean-btc" {
source = "git::https://github.com/poseidon/typhoon//digital-ocean/container-linux/kubernetes?ref=v1.9.3"
region = "sgp1"
dns_zone = "geek.per.sg"
cluster_name = "btc"
image = "coreos-stable"
controller_count = 1
controller_type = "s-4vcpu-8gb"
worker_count = 3
worker_type = "s-4vcpu-8gb"
ssh_fingerprints = ["${var.digitalocean_ssh_fingerprint}"]
# output assets dir
asset_dir = "/home/thor/.secrets/clusters/btc"
}
Yes DNS seems to be working fine. I didn't set the TTL explicitly, it took the default which is 300. When I described the kube-api pod, i get this
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 30m kubelet, 10.130.50.55 MountVolume.SetUp succeeded for volume "var-lock"
Normal SuccessfulMountVolume 30m kubelet, 10.130.50.55 MountVolume.SetUp succeeded for volume "ssl-certs-host"
Normal SuccessfulMountVolume 30m kubelet, 10.130.50.55 MountVolume.SetUp succeeded for volume "default-token-wxrsm"
Normal SuccessfulMountVolume 30m kubelet, 10.130.50.55 MountVolume.SetUp succeeded for volume "secrets"
Warning BackOff 30m (x4 over 30m) kubelet, 10.130.50.55 Back-off restarting failed container
Normal Pulled 29m (x4 over 30m) kubelet, 10.130.50.55 Container image "gcr.io/google_containers/hyperkube:v1.9.3" already present on machine
Normal Created 29m (x4 over 30m) kubelet, 10.130.50.55 Created container
Normal Started 29m (x4 over 30m) kubelet, 10.130.50.55 Started container
Warning FailedMount 29m (x2 over 29m) kubelet, 10.130.50.55 MountVolume.SetUp failed for volume "default-token-wxrsm" : Get https://btc.geek.per.sg:443/api/v1/namespaces/kube-system/secrets/default-token-wxrsm: dial tcp 159.89.200.16:443: getsockopt: connection refused
Eventhough there is this warning, kube-api pod is running.
I moved to using nyc3
region, tried to recreate the cluster twice. So far it didn't have those strange flannel issue. I will try to observe this more.
Again many thanks for the help!
Closing this one. I haven't experienced the issue since moving to nyc3 region.
tldr: I believe this is a problem with Digital Ocean private networking in the sgp1
region. Its not related to Kubernetes or flannel as far as I can tell.
Typhoon Digital Ocean test clusters run in nyc3
and I can't reproduce the issue there, which aligns with what you're seeing. Spinning up a cluster in sgp1
:
Control plane bootstrapping completes successfully. Terraform apply completes successfully. The Kubernetes control plane is technically healthy. Flannel pods on controllers are healthy. However, all flannel pods on workers crash loop which aligns with your initial report.
Workers register with private IPs to use DO private networking. On a healthy DO cluster (say nyc3
), you can SSH to a controller and curl the kubelet health endpoint of a worker.
# good
ssh core@controller.blah
curl curl http://127.0.0.1:10255
404 page not found
Only in clusters in sgp1
, the controller can't route to the worker over private networking (You can SSH into the worker via its public IP and very kubelet is indeed running).
# bad
ssh core@controller.blah
curl http://10.130.74.71:10255
curl: (7) Failed to connect to 10.130.74.71 port 10255: No route to host
# controller
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 32:c3:be:97:84:a0 brd ff:ff:ff:ff:ff:ff
inet 10.130.74.11/16 brd 10.130.255.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::30c3:beff:fe97:84a0/64 scope link
valid_lft forever preferred_lft forever
# worker
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 2e:10:ec:be:4a:f2 brd ff:ff:ff:ff:ff:ff
inet 10.130.74.71/16 brd 10.130.255.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::2c10:ecff:febe:4af2/64 scope link
valid_lft forever preferred_lft forever
An even simpler test is that you can normally SSH into a controller (with agent forwarding) and then SSH to a worker via its private IP. That doesn't work between droplets created in sgp1
.
Its clear this is a host-level, private networking issue between droplets. Or perhaps the firewall rules we're creating aren't actually being applied in reality? As far as I can tell, this isn't related to flannel or Kubernetes or anything Typhoon is doing wrong.
I'm content to say for now don't pick Digital Ocean sgp1
.
@joonas
Bug
Kube-flannel goes into crash loop. This happens not on all flannel pods. Error is unknown host. It seems like it could not contact the api server.
Environment
0.11.1
Problem
Describe the problem. When i bootstrap a kubernetes cluster on DIgital Ocean, kube-flannel goes into crash loop. This does not happen on all pods. For example I bootstrapped a 4 worker nodes, either 1 or 2 pods goes into crash loop. The error reported as "unknown host" when it is trying to connect to the api server.
DUe to this nginx add on does not work anymore. Other pods' status are Running.
Desired Behavior
Flannel in stable or running state.
Describe the goal.
Steps to Reproduce
I simply follow the steps described in digital ocean distribution of Typhoon.