kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.81k stars 4.64k forks source link

GCE CNI is not working #2087

Open chrislovecnm opened 7 years ago

chrislovecnm commented 7 years ago

This is an upstream issue, I am opening it here in order to track.

cc: @thockin @justinsb

Dirbaio commented 7 years ago

I've had success running a cluster on GCE with Calico with some workarounds. Pod-to-pod networking works, load balancers work!

Here's what I did:

chrislovecnm commented 7 years ago

Thanks, @Dirbaio for the workaround! You saved me a bunch of time.

sjezewski commented 7 years ago

Hmmm ... I can't seem to get the workaround working for me.

The symptom's I'm seeing are:

a) kubectl get nodes only returns the master ... so when you say 'wait for the kubelet to be up and detect all nodes' ... that never seems to happen for me b) some of the pods (including calico) under the kube-system namespace don't come up:

17-05-18[18:16:58]:pachyderm:0$kubectl --namespace=kube-system get all
NAME                                                READY     STATUS    RESTARTS   AGE
po/calico-node-crp1q                                1/2       Error     3          3m
po/calico-policy-controller-811246363-b70w7         0/1       Pending   0          3m
po/dns-controller-3881114374-glrrs                  0/1       Pending   0          3m
po/etcd-server-events-master-us-west1-a-dr4r        1/1       Running   0          3m
po/etcd-server-master-us-west1-a-dr4r               1/1       Running   0          3m
po/kube-apiserver-master-us-west1-a-dr4r            1/1       Running   0          2m
po/kube-controller-manager-master-us-west1-a-dr4r   1/1       Running   0          3m
po/kube-dns-1321724180-r90gh                        0/3       Pending   0          3m
po/kube-dns-autoscaler-265231812-74rmf              0/1       Pending   0          3m
po/kube-proxy-master-us-west1-a-dr4r                1/1       Running   0          3m
po/kube-scheduler-master-us-west1-a-dr4r            1/1       Running   0          3m

NAME           CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
svc/kube-dns   100.64.0.10   <none>        53/UDP,53/TCP   3m

NAME                    DESIRED   SUCCESSFUL   AGE
jobs/configure-calico   1         0            3m

NAME                              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/calico-policy-controller   1         1         1            0           3m
deploy/dns-controller             1         1         1            0           3m
deploy/kube-dns                   1         1         1            0           3m
deploy/kube-dns-autoscaler        1         1         1            0           3m

NAME                                    DESIRED   CURRENT   READY     AGE
rs/calico-policy-controller-811246363   1         1         0         3m
rs/dns-controller-3881114374            1         1         0         3m
rs/kube-dns-1321724180                  1         1         0         3m
rs/kube-dns-autoscaler-265231812        1         1         0         3m

In particular ... the calico pod reports an error connecting to etcd:

WARNING: $CALICO_NETWORKING will be deprecated: use $CALICO_NETWORKING_BACKEND instead
time="2017-05-18T23:05:17Z" level=info msg="NODENAME environment not specified - check HOSTNAME" 
time="2017-05-18T23:05:17Z" level=info msg="Loading config from environment" 
Skipping datastore connection test
time="2017-05-18T23:05:47Z" level=info msg="Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://etcd-a.internal.pach-af13f1b4.k8s.com:4001 exceeded header timeout
" 
time="2017-05-18T23:05:47Z" level=info msg="Unable to query node configuration" Name=master-us-west1-a-pl8c error="client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://etcd-a.internal.pach-af13f1b4.k8s.com:4001 exceeded header timeout
" 
ERROR: Unable to access datastore to query node configuration
Terminating
Calico node failed to start

In the syslog on the k8s master node, I see errors preventing the pods from coming up:

May 18 21:30:41 nodes-8lfc kubelet[6176]: I0518 21:30:41.914002    6176 kubelet_node_status.go:77] Attempting to register node nodes-8lfc
May 18 21:30:41 nodes-8lfc kubelet[6176]: E0518 21:30:41.953717    6176 eviction_manager.go:214] eviction manager: unexpected err: failed GetNode: node 'nodes-8lfc' not found
May 18 21:31:11 nodes-8lfc kubelet[6176]: E0518 21:31:11.914799    6176 kubelet_node_status.go:101] Unable to register node "nodes-8lfc" with API server: Post https://api.internal.pach-bf8b2e74.k8s.com/api/v1/nodes: dial tcp 208.73.210.202:443: i/o timeout
May 18 21:31:11 nodes-8lfc kubelet[6176]: E0518 21:31:11.936020    6176 kubelet.go:2067] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR

Which makes sense I think ... if calico isn't coming up.

Looking at the etcd pods ... they're healthy in every way that I can tell (describe / logs / manual attempt at healthcheck).

The other pods in the namespace that are stuck in pending ... report that there are 'no nodes' to schedule them on.

This doesn't quite make sense. Describing the master node ... its only at ~20% utilization for CPU resources. So unless those pods have anti affinity rules? And are also blocking on waiting for the other nodes to get registered to the cluster?

What remains unclear is why calico can't connect to etcd. That seems to be the underlying issue here, but I don't have much in the way of clues as to why that might be.

I'd love it if kops worked for GCE out of the box. We could really use multi cloud support under a single tool (and I love kops ... I'd love it to be this tool). But right now ... it seems like kops is only well supported for AWS. That's a great start ... I hope support for other cloud providers (incl Digital Ocean!) comes soon.

KeithTt commented 6 years ago

@sjezewski got the same error with you:

Nov 17 15:39:51 uy08-08 kubelet[24455]: 2017-11-17 15:39:51.799 [INFO][3937] client.go 202: Loading config from environment
Nov 17 15:39:59 uy08-08 kubelet[24455]: I1117 15:39:59.398876   24455 kuberuntime_manager.go:499] Container {Name:calico-node Image:quay.io/calico/node:v2.6.2 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:ETCD_ENDPOINTS Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:&ConfigMapKeySelector{LocalObjectReference:LocalObjectReference{Name:calico-config,},Key:etcd_endpoints,Optional:nil,},SecretKeyRef:nil,}} {Name:CALICO_NETWORKING_BACKEND Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:&ConfigMapKeySelector{LocalObjectReference:LocalObjectReference{Name:calico-config,},Key:calico_backend,Optional:nil,},SecretKeyRef:nil,}} {Name:CLUSTER_TYPE Value:kubeadm,bgp ValueFrom:nil} {Name:CALICO_DISABLE_FILE_LOGGING Value:true ValueFrom:nil} {Name:FELIX_DEFAULTENDPOINTTOHOSTACTION Value:ACCEPT ValueFrom:nil} {Name:CALICO_IPV4POOL_CIDR Value:192.168.122.0/24 ValueFrom:nil} {Name:CALICO_IPV4POOL_IPIP Value:always ValueFrom:nil} {Name:FELIX_IPV6SUPPORT Value:false ValueFrom:nil} {Name:FELIX_IPINIPMTU Value:1440 ValueFrom:nil} {Name:FELIX_LOGSEVERITYSCREEN Value:info ValueFrom:nil} {Name:IP Value: ValueFrom:nil} {Name:FELIX_HEALTHENABLED Value:true ValueFrom:nil}] Resources:{Limits:map[] Requests:map[cpu:{i:{value:250 scale:-3} d:{Dec:<nil>} s:250m Format:DecimalSI}]} VolumeMounts:[{Name:lib-modules ReadOnly:true MountPath:/lib/modules SubPath: MountPropagation:<nil>} {Name:var-run-calico ReadOnly:false MountPath:/var/run/calico SubPath: MountPropagation:<nil>} {Name:calico-cni-plugin-token-5qtk2 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/liveness,Port:9099,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:6,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readiness,Port:9099,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDe
Nov 17 15:39:59 uy08-08 kubelet[24455]: laySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Nov 17 15:39:59 uy08-08 kubelet[24455]: I1117 15:39:59.399155   24455 kuberuntime_manager.go:738] checking backoff for container "calico-node" in pod "calico-node-2sb8q_kube-system(5351b070-cbc4-11e7-9fbc-34e6d7899e5d)"
Nov 17 15:39:59 uy08-08 kubelet[24455]: I1117 15:39:59.399373   24455 kuberuntime_manager.go:748] Back-off 5m0s restarting failed container=calico-node pod=calico-node-2sb8q_kube-system(5351b070-cbc4-11e7-9fbc-34e6d7899e5d)
Nov 17 15:39:59 uy08-08 kubelet[24455]: E1117 15:39:59.399406   24455 pod_workers.go:182] Error syncing pod 5351b070-cbc4-11e7-9fbc-34e6d7899e5d ("calico-node-2sb8q_kube-system(5351b070-cbc4-11e7-9fbc-34e6d7899e5d)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2sb8q_kube-system(5351b070-cbc4-11e7-9fbc-34e6d7899e5d)"
Nov 17 15:39:59 uy08-08 kubelet[24455]: 2017-11-17 15:39:59.862 [INFO][3899] etcd.go 373: Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
Nov 17 15:39:59 uy08-08 kubelet[24455]: E1117 15:39:59.864160   24455 cni.go:319] Error deleting network: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
Nov 17 15:39:59 uy08-08 kubelet[24455]: E1117 15:39:59.864927   24455 remote_runtime.go:115] StopPodSandbox "319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "kube-dns-545bc4bfd4-xb5z9_kube-system" network: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
Nov 17 15:39:59 uy08-08 kubelet[24455]: E1117 15:39:59.864967   24455 kuberuntime_manager.go:780] Failed to stop sandbox {"docker" "319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a"}
Nov 17 15:39:59 uy08-08 kubelet[24455]: E1117 15:39:59.865041   24455 kuberuntime_manager.go:580] killPodWithSyncResult failed: failed to "KillPodSandbox" for "bfdfe510-cbcd-11e7-9258-f8db8846245c" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kube-dns-545bc4bfd4-xb5z9_kube-system\" network: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout\n"
Nov 17 15:39:59 uy08-08 kubelet[24455]: E1117 15:39:59.865079   24455 pod_workers.go:182] Error syncing pod bfdfe510-cbcd-11e7-9258-f8db8846245c ("kube-dns-545bc4bfd4-xb5z9_kube-system(bfdfe510-cbcd-11e7-9258-f8db8846245c)"), skipping: failed to "KillPodSandbox" for "bfdfe510-cbcd-11e7-9258-f8db8846245c" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kube-dns-545bc4bfd4-xb5z9_kube-system\" network: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout\n"
Nov 17 15:40:00 uy08-08 kubelet[24455]: W1117 15:40:00.922455   24455 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-xb5z9_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a"
Nov 17 15:40:00 uy08-08 kubelet[24455]: W1117 15:40:00.923423   24455 cni.go:265] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a"
Nov 17 15:40:01 uy08-08 kubelet[24455]: 2017-11-17 15:40:01.041 [INFO][3969] calico.go 315: Extracted identifiers ContainerID="319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a" Node="uy08-08" Orchestrator="k8s" Workload="kube-system.kube-dns-545bc4bfd4-xb5z9"
Nov 17 15:40:01 uy08-08 kubelet[24455]: 2017-11-17 15:40:01.041 [INFO][3969] utils.go 250: Configured environment: [LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin INVOCATION_ID=01c46102bc694a43a623f768273b5a1c JOURNAL_STREAM=8:91880 KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt KUBELET_CADVISOR_ARGS=--cadvisor-port=0 KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki CNI_COMMAND=DEL CNI_CONTAINERID=319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=kube-dns-545bc4bfd4-xb5z9;K8S_POD_INFRA_CONTAINER_ID=319ff7d36d67170d9c9f088c825d87096572ac83cdc8d3054da5c3163e358d3a CNI_IFNAME=eth0 CNI_PATH=/opt/calico/bin:/opt/cni/bin ETCD_ENDPOINTS=http://10.96.232.136:6666 KUBECONFIG=/etc/cni/net.d/calico-kubeconfig K8S_API_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJjYWxpY28tY25pLXBsdWdpbi10b2tlbi01cXRrMiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJjYWxpY28tY25pLXBsdWdpbiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjE4NmZlYmI5LWNhM2ItMTFlNy05ZmJjLTM0ZTZkNzg5OWU1ZCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlLXN5c3RlbTpjYWxpY28tY25pLXBsdWdpbiJ9.TzNAVry1HBgJa3SH93mLlSxv5927MywAp_02dR1zf_Pht_F4AWXPMASUSF1rmMVuzEprp3drgIeYg1G-oN3oPIy3tnryAtmZbsmjBCylpLntjgZaJcXneCYgk8G8I0WfWO6H6jcG46cVoRB-3FQjKQKzedbgnURUA2EOE4sN2oLOSp5R0LMyh4GZQIEm1zW
Nov 17 15:40:01 uy08-08 kubelet[24455]: Xn8OSXQ3qh9iehXm9xpep3krkf5uoBcPfe-XrHjfPETyVSTS6oADdcO3RsIQDlQOtEGKy0WnJbIRcHQiIcIVVDf1MT5Yo6gWzD7dlEynLsvo4tvEuAi0IgCsO7k34PwuMjys-FQWcPUwF1uOOgD-XAA]

Also, my etcd cluster is healthy:

# export ETCDCTL_API=2
# etcdctl cluster-health
member 93ac7045b7c80fe2 is healthy: got healthy result from http://192.168.5.105:2379
member cceea3802386922f is healthy: got healthy result from http://192.168.5.104:2379
member e1f394bfa58b2a7f is healthy: got healthy result from http://192.168.5.42:2379
cluster is healthy
chrislovecnm commented 6 years ago

/cc @bboreham @caseydavenport

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

chrislovecnm commented 6 years ago

/lifecycle frozen /remove-lifecycle stale

ae-v commented 5 years ago

Flannel doesn't work either. Only kubenet is working ootb on GCE.

christianh814 commented 5 years ago

Adding my +1 as I just ran into this today testing on GCE. (CNI read only). Going to try the workaround later.

I got the same issue of "Master nodes are up but nodes never come up".

christianh814 commented 5 years ago

So I got the following in the kubelet logs

root@nodes-3lbm:~# systemctl status kubelet.service  -l --no-pager
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/lib/systemd/system/kubelet.service; static; vendor preset: enabled)
   Active: active (running) since Mon 2019-02-18 17:17:18 UTC; 12min ago
     Docs: https://github.com/kubernetes/kubernetes
 Main PID: 6144 (kubelet)
    Tasks: 15
   Memory: 42.8M
      CPU: 11.959s
   CGroup: /system.slice/kubelet.service
           └─6144 /usr/local/bin/kubelet --allow-privileged=true --anonymous-auth=false --cgroup-root=/ --client-ca-file=/srv/kubernetes/ca.crt --cloud-provider=gce --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true --hairpin-mode=promiscuous-bridge --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-labels=kops.k8s.io/instancegroup=nodes,kubernetes.io/role=node,node-role.kubernetes.io/node= --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=k8s.gcr.io/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --v=2 --cloud-config=/etc/kubernetes/cloud.config --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/

Feb 18 17:29:27 nodes-3lbm kubelet[6144]: W0218 17:29:27.251404    6144 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 17:29:27 nodes-3lbm kubelet[6144]: E0218 17:29:27.251544    6144 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feb 18 17:29:28 nodes-3lbm kubelet[6144]: E0218 17:29:28.692316    6144 event.go:212] Unable to write event: 'Post https://api.internal.k8s.chx.cloud/api/v1/namespaces/default/events: dial tcp 203.0.113.123:443: i/o timeout' (may retry after sleeping)
Feb 18 17:29:32 nodes-3lbm kubelet[6144]: E0218 17:29:32.052379    6144 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "nodes-3lbm" not found
Feb 18 17:29:32 nodes-3lbm kubelet[6144]: I0218 17:29:32.139680    6144 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "nodes-3lbm"
Feb 18 17:29:32 nodes-3lbm kubelet[6144]: I0218 17:29:32.144031    6144 cloud_request_manager.go:108] Node addresses from cloud provider for node "nodes-3lbm" collected
Feb 18 17:29:32 nodes-3lbm kubelet[6144]: W0218 17:29:32.253006    6144 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 17:29:32 nodes-3lbm kubelet[6144]: E0218 17:29:32.253173    6144 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feb 18 17:29:37 nodes-3lbm kubelet[6144]: W0218 17:29:37.254504    6144 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 17:29:37 nodes-3lbm kubelet[6144]: E0218 17:29:37.255133    6144 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

This always hapens when using GCE

Only the masters ever come up

# kubectl get nodes
NAME                     STATUS    ROLES     AGE       VERSION
master-us-east1-b-m2v0   Ready     master    14m       v1.11.6
master-us-east1-c-35mz   Ready     master    14m       v1.11.6
master-us-east1-d-27w5   Ready     master    14m       v1.11.6
bboreham commented 5 years ago

Which CNI implementation did you select?

christianh814 commented 5 years ago

I used Calico.

christianh814 commented 5 years ago

Using flannel also doesn't work

Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.075611    7822 server.go:986] Started kubelet
Feb 18 20:42:15 nodes-g86f kubelet[7822]: W0218 20:42:15.076407    7822 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 20:42:15 nodes-g86f kubelet[7822]: E0218 20:42:15.076568    7822 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.077666    7822 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "nodes-g86f"
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.077827    7822 desired_state_of_world_populator.go:130] Desired state populator starts to run
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.079473    7822 cloud_request_manager.go:108] Node addresses from cloud provider for node "nodes-g86f" collected
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.111010    7822 factory.go:356] Registering Docker factory
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.113141    7822 factory.go:54] Registering systemd factory
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.113507    7822 factory.go:86] Registering Raw factory
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.113885    7822 manager.go:1205] Started watching for new ooms in manager
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.115803    7822 manager.go:356] Starting recovery of all containers
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.179219    7822 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.179385    7822 kubelet.go:1771] skipping pod synchronization - [container runtime is down]
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.180767    7822 manager.go:361] Recovery completed
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.209020    7822 kubelet_node_status.go:317] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n1-standard-2
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.209558    7822 kubelet_node_status.go:328] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=us-east1-b
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.209874    7822 kubelet_node_status.go:332] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=us-east1
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.212009    7822 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.212463    7822 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.212807    7822 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.213151    7822 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.213491    7822 kubelet_node_status.go:79] Attempting to register node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.248387    7822 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.252353    7822 kubelet_node_status.go:317] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n1-standard-2
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.252381    7822 kubelet_node_status.go:328] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=us-east1-b
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.252388    7822 kubelet_node_status.go:332] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=us-east1
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254741    7822 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254775    7822 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254789    7822 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254799    7822 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node nodes-g86f
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254828    7822 cpu_manager.go:155] [cpumanager] starting with none policy
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254836    7822 cpu_manager.go:156] [cpumanager] reconciling every 10s
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.254845    7822 policy_none.go:42] [cpumanager] none policy: Start
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.255596    7822 manager.go:201] Starting Device Plugin manager
Feb 18 20:42:15 nodes-g86f kubelet[7822]: Starting Device Plugin manager
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.255841    7822 manager.go:237] Serving device plugin registration server on "/var/lib/kubelet/device-plugins/kubelet.sock"
Feb 18 20:42:15 nodes-g86f kubelet[7822]: E0218 20:42:15.255931    7822 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "nodes-g86f" not found
Feb 18 20:42:15 nodes-g86f kubelet[7822]: I0218 20:42:15.256317    7822 container_manager_linux.go:428] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Feb 18 20:42:20 nodes-g86f kubelet[7822]: W0218 20:42:20.256959    7822 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 20:42:20 nodes-g86f kubelet[7822]: E0218 20:42:20.257260    7822 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feb 18 20:42:25 nodes-g86f kubelet[7822]: I0218 20:42:25.079657    7822 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "nodes-g86f"
Feb 18 20:42:25 nodes-g86f kubelet[7822]: I0218 20:42:25.082336    7822 cloud_request_manager.go:108] Node addresses from cloud provider for node "nodes-g86f" collected
Feb 18 20:42:25 nodes-g86f kubelet[7822]: E0218 20:42:25.256077    7822 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "nodes-g86f" not found
Feb 18 20:42:25 nodes-g86f kubelet[7822]: W0218 20:42:25.258214    7822 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 20:42:25 nodes-g86f kubelet[7822]: E0218 20:42:25.258673    7822 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feb 18 20:42:30 nodes-g86f kubelet[7822]: W0218 20:42:30.259980    7822 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Feb 18 20:42:30 nodes-g86f kubelet[7822]: E0218 20:42:30.260564    7822 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Masters are up though

$ kubectl get nodes
NAME                     STATUS    ROLES     AGE       VERSION
master-us-east1-b-bph2   Ready     master    22m       v1.11.6
master-us-east1-c-2ww4   Ready     master    22m       v1.11.6
master-us-east1-d-hw6r   Ready     master    22m       v1.11.6

Flannel "seems" to be up

 kubectl get pods -n kube-system
NAME                                             READY     STATUS    RESTARTS   AGE
dns-controller-85ffbb4fb-tm4ff                   0/1       Pending   0          23m
etcd-server-events-master-us-east1-b-bph2        1/1       Running   1          22m
etcd-server-events-master-us-east1-c-2ww4        1/1       Running   0          22m
etcd-server-events-master-us-east1-d-hw6r        1/1       Running   0          22m
etcd-server-master-us-east1-b-bph2               1/1       Running   1          23m
etcd-server-master-us-east1-c-2ww4               1/1       Running   0          22m
etcd-server-master-us-east1-d-hw6r               1/1       Running   0          23m
kube-apiserver-master-us-east1-b-bph2            1/1       Running   2          22m
kube-apiserver-master-us-east1-c-2ww4            1/1       Running   0          22m
kube-apiserver-master-us-east1-d-hw6r            1/1       Running   0          22m
kube-controller-manager-master-us-east1-b-bph2   1/1       Running   0          22m
kube-controller-manager-master-us-east1-c-2ww4   1/1       Running   0          22m
kube-controller-manager-master-us-east1-d-hw6r   1/1       Running   0          23m
kube-dns-6b4f4b544c-4sztd                        0/3       Pending   0          23m
kube-dns-autoscaler-6b658bd4d5-gtfcp             0/1       Pending   0          23m
kube-flannel-ds-g6hqp                            1/1       Running   0          23m
kube-flannel-ds-glfhq                            1/1       Running   0          23m
kube-flannel-ds-hlzxd                            1/1       Running   0          23m
kube-proxy-master-us-east1-b-bph2                1/1       Running   0          22m
kube-proxy-master-us-east1-c-2ww4                1/1       Running   0          22m
kube-proxy-master-us-east1-d-hw6r                1/1       Running   0          23m
kube-scheduler-master-us-east1-b-bph2            1/1       Running   0          23m
kube-scheduler-master-us-east1-c-2ww4            1/1       Running   0          22m
kube-scheduler-master-us-east1-d-hw6r            1/1       Running   0          23m
bboreham commented 5 years ago

I ask because I fixed the reported issue in Weave Net at https://github.com/weaveworks/weave/pull/3307 it would be interesting to know whether this lets kops work.

christianh814 commented 5 years ago

@bboreham

I'll test it with the weave CNI plugin tomorrow to test

christianh814 commented 5 years ago

So using weave works.

For reference, this is what I ran

kops create cluster \
    --node-count 3 \
    --zones us-east1-b,us-east1-c,us-east1-d \
    --master-zones us-east1-b,us-east1-c,us-east1-d \
    --dns-zone k8s.example.com \
    --node-size n1-standard-2 \
    --master-size n1-standard-2 \
    --networking weave \
    --project $(gcloud config get-value project) \
    --ssh-public-key ~/.ssh/id_rsa.pub \
    --state gs://example-obj-store/ \
    --api-loadbalancer-type public \
    --image "ubuntu-os-cloud/ubuntu-1604-xenial-v20170202" \
    k8s.example.com

Everything came up fine.

$ kubectl get pods --all-namespaces 
NAMESPACE     NAME                                             READY     STATUS    RESTARTS   AGE
kube-system   dns-controller-85ffbb4fb-4b5wg                   1/1       Running   0          13m
kube-system   etcd-server-events-master-us-east1-b-7d52        1/1       Running   0          12m
kube-system   etcd-server-events-master-us-east1-c-k5pw        1/1       Running   0          12m
kube-system   etcd-server-events-master-us-east1-d-0crt        1/1       Running   0          13m
kube-system   etcd-server-master-us-east1-b-7d52               1/1       Running   0          12m
kube-system   etcd-server-master-us-east1-c-k5pw               1/1       Running   0          12m
kube-system   etcd-server-master-us-east1-d-0crt               1/1       Running   0          13m
kube-system   kube-apiserver-master-us-east1-b-7d52            1/1       Running   0          12m
kube-system   kube-apiserver-master-us-east1-c-k5pw            1/1       Running   0          12m
kube-system   kube-apiserver-master-us-east1-d-0crt            1/1       Running   0          12m
kube-system   kube-controller-manager-master-us-east1-b-7d52   1/1       Running   0          12m
kube-system   kube-controller-manager-master-us-east1-c-k5pw   1/1       Running   0          12m
kube-system   kube-controller-manager-master-us-east1-d-0crt   1/1       Running   0          13m
kube-system   kube-dns-6b4f4b544c-sk8sf                        3/3       Running   0          10m
kube-system   kube-dns-6b4f4b544c-xdflk                        3/3       Running   0          13m
kube-system   kube-dns-autoscaler-6b658bd4d5-rpsl2             1/1       Running   0          13m
kube-system   kube-proxy-master-us-east1-b-7d52                1/1       Running   0          12m
kube-system   kube-proxy-master-us-east1-c-k5pw                1/1       Running   0          12m
kube-system   kube-proxy-master-us-east1-d-0crt                1/1       Running   0          12m
kube-system   kube-proxy-nodes-3qjf                            1/1       Running   0          9m
kube-system   kube-proxy-nodes-fj6r                            1/1       Running   0          10m
kube-system   kube-proxy-nodes-l3f6                            1/1       Running   0          10m
kube-system   kube-scheduler-master-us-east1-b-7d52            1/1       Running   0          12m
kube-system   kube-scheduler-master-us-east1-c-k5pw            1/1       Running   0          12m
kube-system   kube-scheduler-master-us-east1-d-0crt            1/1       Running   0          13m
kube-system   weave-net-6pbbl                                  2/2       Running   0          10m
kube-system   weave-net-c6ncv                                  2/2       Running   0          13m
kube-system   weave-net-pw4x7                                  2/2       Running   0          13m
kube-system   weave-net-qgbv4                                  2/2       Running   0          11m
kube-system   weave-net-rf7pm                                  2/2       Running   0          13m
kube-system   weave-net-s6bln                                  2/2       Running   1          11m

I didn't have to "hack" the node status as it was all fine

$ curl http://localhost:8080/api/v1/nodes/nodes-fj6r/status
{
  "kind": "Node",
  "apiVersion": "v1",
  "metadata": {
    "name": "nodes-fj6r",
    "selfLink": "/api/v1/nodes/nodes-fj6r/status",
    "uid": "7708602d-344f-11e9-81e4-42010a8e001d",
    "resourceVersion": "2094",
    "creationTimestamp": "2019-02-19T14:05:51Z",
    "labels": {
      "beta.kubernetes.io/arch": "amd64",
      "beta.kubernetes.io/instance-type": "n1-standard-2",
      "beta.kubernetes.io/os": "linux",
      "failure-domain.beta.kubernetes.io/region": "us-east1",
      "failure-domain.beta.kubernetes.io/zone": "us-east1-b",
      "kops.k8s.io/instancegroup": "nodes",
      "kubernetes.io/hostname": "nodes-fj6r",
      "kubernetes.io/role": "node",
      "node-role.kubernetes.io/node": ""
    },
    "annotations": {
      "node.alpha.kubernetes.io/ttl": "0",
      "volumes.kubernetes.io/controller-managed-attach-detach": "true"
    }
  },
  "spec": {
    "podCIDR": "100.96.4.0/24",
    "providerID": "gce://kops-chx/us-east1-b/nodes-fj6r"
  },
  "status": {
    "capacity": {
      "cpu": "2",
      "ephemeral-storage": "130046416Ki",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "7659276Ki",
      "pods": "110"
    },
    "allocatable": {
      "cpu": "2",
      "ephemeral-storage": "119850776788",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "7556876Ki",
      "pods": "110"
    },
    "conditions": [
      {
        "type": "NetworkUnavailable",
        "status": "False",
        "lastHeartbeatTime": "2019-02-19T14:05:57Z",
        "lastTransitionTime": "2019-02-19T14:05:57Z",
        "reason": "WeaveIsUp",
        "message": "Weave pod has set this"
      },
      {
        "type": "OutOfDisk",
        "status": "False",
        "lastHeartbeatTime": "2019-02-19T14:17:33Z",
        "lastTransitionTime": "2019-02-19T14:05:51Z",
        "reason": "KubeletHasSufficientDisk",
        "message": "kubelet has sufficient disk space available"
      },
      {
        "type": "MemoryPressure",
        "status": "False",
        "lastHeartbeatTime": "2019-02-19T14:17:33Z",
        "lastTransitionTime": "2019-02-19T14:05:51Z",
        "reason": "KubeletHasSufficientMemory",
        "message": "kubelet has sufficient memory available"
      },
      {
        "type": "DiskPressure",
        "status": "False",
        "lastHeartbeatTime": "2019-02-19T14:17:33Z",
        "lastTransitionTime": "2019-02-19T14:05:51Z",
        "reason": "KubeletHasNoDiskPressure",
        "message": "kubelet has no disk pressure"
      },
      {
        "type": "PIDPressure",
        "status": "False",
        "lastHeartbeatTime": "2019-02-19T14:17:33Z",
        "lastTransitionTime": "2019-02-19T14:05:51Z",
        "reason": "KubeletHasSufficientPID",
        "message": "kubelet has sufficient PID available"
      },
      {
        "type": "Ready",
        "status": "True",
        "lastHeartbeatTime": "2019-02-19T14:17:33Z",
        "lastTransitionTime": "2019-02-19T14:06:11Z",
        "reason": "KubeletReady",
        "message": "kubelet is posting ready status. AppArmor enabled"
      }
    ],
    "addresses": [
      {
        "type": "InternalIP",
        "address": "10.142.0.30"
      },
      {
        "type": "ExternalIP",
        "address": "34.73.1.224"
      },
      {
        "type": "Hostname",
        "address": "nodes-fj6r"
      }
    ],
    "daemonEndpoints": {
      "kubeletEndpoint": {
        "Port": 10250
      }
    },
    "nodeInfo": {
      "machineID": "29fcd7edc451f2f25a62066bc395b5e8",
      "systemUUID": "29FCD7ED-C451-F2F2-5A62-066BC395B5E8",
      "bootID": "2dd7efc4-9d95-46a5-a02f-7762afdafe9c",
      "kernelVersion": "4.4.0-62-generic",
      "osImage": "Ubuntu 16.04.1 LTS",
      "containerRuntimeVersion": "docker://17.3.2",
      "kubeletVersion": "v1.11.6",
      "kubeProxyVersion": "v1.11.6",
      "operatingSystem": "linux",
      "architecture": "amd64"
    },
    "images": [
      {
        "names": [
          "protokube:1.11.0"
        ],
        "sizeBytes": 282689309
      },
      {
        "names": [
          "weaveworks/weave-kube@sha256:f1b6edd296cf0b7e806b1a1a1f121c1e8095852a4129edd08401fe2e7aab652d",
          "weaveworks/weave-kube:2.5.0"
        ],
        "sizeBytes": 148083959
      },
      {
        "names": [
          "k8s.gcr.io/kube-proxy@sha256:de320f2611b72465371292c87d892e64b01bf5e27b211b9e8969a239d0f2523a",
          "k8s.gcr.io/kube-proxy:v1.11.6"
        ],
        "sizeBytes": 98120519
      },
      {
        "names": [
          "weaveworks/weave-npc@sha256:5bc9e4241eb0e972d3766864b2aca085660638b9d596d4fe761096db46a8c60b",
          "weaveworks/weave-npc:2.5.0"
        ],
        "sizeBytes": 49506380
      },
      {
        "names": [
          "k8s.gcr.io/pause-amd64@sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516",
          "k8s.gcr.io/pause-amd64:3.0"
        ],
        "sizeBytes": 746888
      }
    ]
  }
}

Only issue is that the public DNS never gets updated from the temp IP. I had to go and manually change it. However this is a separate issue that looks like it's being tracked here it seems.

Thanks @bboreham! Looks like the weave plugin works!

christianh814 commented 5 years ago

I do have to edit the firewall rules as @Dirbaio described as well