kubernetes-retired / kube-aws

[EOL] A command-line tool to declaratively manage Kubernetes clusters on AWS
Apache License 2.0
1.12k stars 295 forks source link

Kubernetes v1.6.x: CNI + Self-hosted Calico is broken #494

Closed mumoshu closed 7 years ago

mumoshu commented 7 years ago

Due to the dead-lock between the installation of cni config for calico to /etc/kubernetes/cni/net.d and the node readiness explained in https://github.com/kubernetes/kubernetes/issues/43815

Calico and other pods are pending:

kubectl get po -n kube-system
NAME                                                                    READY     STATUS    RESTARTS   AGE
calico-policy-controller-2425378810-7nw53                               0/1       Pending   0          1h
heapster-v1.3.0-170904332-xtdf9                                         0/2       Pending   0          1h
kube-apiserver-ip-10-0-0-140.ap-northeast-1.compute.internal            1/1       Running   0          1h
kube-apiserver-ip-10-0-0-41.ap-northeast-1.compute.internal             1/1       Running   0          1h
kube-controller-manager-ip-10-0-0-140.ap-northeast-1.compute.internal   1/1       Running   0          1h
kube-controller-manager-ip-10-0-0-41.ap-northeast-1.compute.internal    1/1       Running   0          1h
kube-dns-1358247298-l64sl                                               0/4       Pending   0          1h
kube-dns-autoscaler-2700806437-8nkv9                                    0/1       Pending   0          1h
kube-proxy-ip-10-0-0-140.ap-northeast-1.compute.internal                1/1       Running   0          1h
kube-proxy-ip-10-0-0-41.ap-northeast-1.compute.internal                 1/1       Running   0          1h
kube-scheduler-ip-10-0-0-140.ap-northeast-1.compute.internal            1/1       Running   0          1h
kube-scheduler-ip-10-0-0-41.ap-northeast-1.compute.internal             1/1       Running   0          1h
kubernetes-dashboard-v1.5.1-xnwmh                                       0/1       Pending   0          1h

due to the node being unready when the cni implementation(=calico) is not ready:

$ journalctl -u kubelet.service
*snip*
cni.go:157] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d

and results in the dead-lock.

Not sure how we could fix it yet.

mumoshu commented 7 years ago

So - how Self-hosted Calico can properly be configured after https://github.com/kubernetes/kubernetes/pull/43474?

mumoshu commented 7 years ago

AFAIK, the install-cni container in the calico-node pod had been populating /opt/bin/cni on the host which is then mounted to the kubelet pod to provide the calico cni plugin. However, the hyperkube image as of today seems to already contain the calico cni plugin in it:

$ exec into one of hyperkube containers
$ docker exec -it 1021179ecff8 sh
# ls /opt/cni/bin
bridge  calico  cnitool  dhcp  flannel  host-local  ipvlan  loopback  macvlan  noop  ptp  tuning

So - would it work if we just stop mounting host's /opt/cni/bin to kubelet containers which is done in https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-controller#L146-L148 ?

Edit: It doesn't work because install-cni also populates /etc/kubernetes/cni/net.d/ with a config file for cni.

mumoshu commented 7 years ago

install-cni is working on every node. Now, kube-dns,heapster, kube-dashboard are not working:

https://gist.github.com/mumoshu/2b873613f2b7ffa191ada89df9bc5523

Edit: calico-policy-controller seems to never succeed to watch NetworkPolicy k8s objects:

$ kubesys logs -f calico-policy-controller-2947401832-c9p2q
*snip*
2017-04-05 00:34:34,590 6 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 00:34:35,588 6 INFO Syncing 'Pod' objects
2017-04-05 00:34:35,588 6 INFO Syncing 'Namespace' objects
2017-04-05 00:34:35,592 6 INFO Syncing 'NetworkPolicy' objects
2017-04-05 00:34:35,606 6 ERROR Unhandled exception killed Namespace manager
Traceback (most recent call last):
  File "<string>", line 311, in _manage_resource
  File "<string>", line 424, in _sync_resources
  File "site-packages/requests/models.py", line 808, in json
  File "site-packages/simplejson/__init__.py", line 516, in loads
  File "site-packages/simplejson/decoder.py", line 374, in decode
  File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2017-04-05 00:34:35,606 6 INFO Restarting watch on resource: Namespace
2017-04-05 00:34:35,610 6 ERROR Unhandled exception killed Pod manager
Traceback (most recent call last):
  File "<string>", line 311, in _manage_resource
  File "<string>", line 424, in _sync_resources
  File "site-packages/requests/models.py", line 808, in json
  File "site-packages/simplejson/__init__.py", line 516, in loads
  File "site-packages/simplejson/decoder.py", line 374, in decode
  File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2017-04-05 00:34:35,610 6 INFO Restarting watch on resource: Pod
2017-04-05 00:34:35,610 6 ERROR Unhandled exception killed NetworkPolicy manager
Traceback (most recent call last):
  File "<string>", line 311, in _manage_resource
  File "<string>", line 424, in _sync_resources
  File "site-packages/requests/models.py", line 808, in json
  File "site-packages/simplejson/__init__.py", line 516, in loads
  File "site-packages/simplejson/decoder.py", line 374, in decode
  File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2017-04-05 00:34:35,611 6 INFO Restarting watch on resource: NetworkPolicy
mumoshu commented 7 years ago

After upgrading calico to 2.1.1 with #496, calico-policy-controller emits different errors:

2017-04-05 01:31:16,453 8 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 01:31:17,424 8 INFO Syncing 'Pod' objects
2017-04-05 01:31:17,438 8 INFO Syncing 'Namespace' objects
2017-04-05 01:31:17,454 8 INFO Syncing 'NetworkPolicy' objects
2017-04-05 01:31:22,430 8 WARNING Connection attempt timed out: Pod ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/pods (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f584b4eaa90>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 01:31:22,430 8 INFO Restarting watch on resource: Pod
2017-04-05 01:31:22,442 8 WARNING Connection attempt timed out: Namespace ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/namespaces (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f584b4cbfd0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 01:31:22,443 8 INFO Restarting watch on resource: Namespace
2017-04-05 01:31:22,456 8 WARNING Connection attempt timed out: NetworkPolicy ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/extensions/v1beta1/networkpolicies (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f584b46e0d0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 01:31:22,457 8 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 01:31:23,430 8 INFO Syncing 'Pod' objects
2017-04-05 01:31:23,444 8 INFO Syncing 'Namespace' objects
2017-04-05 01:31:23,458 8 INFO Syncing 'NetworkPolicy' objects
mumoshu commented 7 years ago

Sounds very similar to https://github.com/projectcalico/k8s-policy/issues/42

mumoshu commented 7 years ago

~I believe that to resolve kubernetes.default from calico-policy-controller, kube-dns must be running. However, kube-dns requires calico-policy-controller to be running - a dead-lock.~

calico-policy-controller container does have a /etc/hosts entry to resolve kubernetes.default without kube-dns. curl kubernetes.default seems to just hang though:

core@ip-10-0-0-99 ~ $ docker ps | grep poli
e2dcbaeeaf2f        quay.io/calico/kube-policy-controller@sha256:6ccc8caa6bf8ba94f5191532f5e279a307160cc9eb863e90b0200403bf8024d6   "/dist/controller"       22 minutes ago      Up 22 minutes                           k8s_calico-policy-controller_calico-policy-controller-2947401832-lb933_kube-system_00ec08d6-199e-11e7-8582-06c6148c2095_0
35c416a4c751        gcr.io/google_containers/pause-amd64:3.0                                                                        "/pause"                 24 minutes ago      Up 24 minutes                           k8s_POD_calico-policy-controller-2947401832-lb933_kube-system_00ec08d6-199e-11e7-8582-06c6148c2095_0
core@ip-10-0-0-99 ~ $ docker exec -it e2dcbaeeaf2f sh
/ # curl -k https://kubernetes.default:443
sh: curl: not found
/ # apk add --update --no-cache curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20161130-r0)
(2/4) Installing libssh2 (1.7.0-r0)
(3/4) Installing libcurl (7.52.1-r1)
(4/4) Installing curl (7.52.1-r1)
Executing busybox-1.24.2-r13.trigger
Executing ca-certificates-20161130-r0.trigger
OK: 6 MiB in 15 packages
/ # curl -k https://kubernetes.default:443
^C
/ # cat /etc/hosts
# /etc/hosts: Local Host Database
#
# This file describes a number of aliases-to-address mappings for the for
# local hosts that share this file.
#
# The format of lines in this file is:
#
# IP_ADDRESS    canonical_hostname  [aliases...]
#
#The fields can be separated by any number of spaces or tabs.
#
# In the presence of the domain name service or NIS, this file may not be
# consulted at all; see /etc/host.conf for the resolution order.
#

# IPv4 and IPv6 localhost aliases
127.0.0.1   localhost
::1     localhost

#
# Imaginary network.
#10.0.0.2               myname
#10.0.0.3               myfriend
#
# According to RFC 1918, you can use the following IP networks for private
# nets which will never be connected to the Internet:
#
#       10.0.0.0        -   10.255.255.255
#       172.16.0.0      -   172.31.255.255
#       192.168.0.0     -   192.168.255.255
#
# In case you want to be able to connect directly to the Internet (i.e. not
# behind a NAT, ADSL router, etc...), you need real official assigned
# numbers.  Do not try to invent your own network numbers but instead get one
# from your network provider (if any) or from your regional registry (ARIN,
# APNIC, LACNIC, RIPE NCC, or AfriNIC.)
#
10.3.0.1    kubernetes.default

An iptables rule for the k8s service 10.3.0.1 does exist but it is not resolvable from the container:

core@ip-10-0-0-99 ~ $ sudo iptables-save | grep 10.3.0.1/32
-A KUBE-SERVICES -d 10.3.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
core@ip-10-0-0-99 ~ $ docker exec -it e2dcbaeeaf2f sh
/ # curl -k https://10.3.0.1
^C

However, one of 2 underlying kube-apiserver pod ips are resolvable:

core@ip-10-0-0-99 ~ $ docker exec -it e2dcbaeeaf2f sh
/ # curl -k https://10.0.0.99
Unauthorized
/ # curl -k https://10.0.0.186
^C
/ # curl -k https://10.0.0.99
Unauthorized
/ # curl -k https://10.0.0.186
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.0.0.186
^C
mumoshu commented 7 years ago

To be short, it seems like calico-policy-controller is trying communicate to the port 443 on one of controller nodes. As we don't have a security group to allow communication to the port 443 among controller nodes for now, it fails when the controller node(and its private ip) chosen randomly by iptables configured by kube-proxy is a node other than the one the calico-policy-controller is running.

mumoshu commented 7 years ago

Adding the inbound rule to allow 443 among controller nodes to the controller SG fixed calico-policy-controller. You'll see that the connectivity issue has gone starting the middle of the log below:

2017-04-05 02:38:08,507 5 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 02:38:09,494 5 INFO Syncing 'Namespace' objects
2017-04-05 02:38:09,507 5 INFO Syncing 'Pod' objects
2017-04-05 02:38:09,509 5 INFO Syncing 'NetworkPolicy' objects
2017-04-05 02:38:14,500 5 WARNING Connection attempt timed out: Namespace ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/namespaces (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fdb7d8e4890>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 02:38:14,500 5 INFO Restarting watch on resource: Namespace
2017-04-05 02:38:14,513 5 WARNING Connection attempt timed out: Pod ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/pods (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fdb7d8655d0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 02:38:14,513 5 INFO Restarting watch on resource: Pod
2017-04-05 02:38:14,513 5 WARNING Connection attempt timed out: NetworkPolicy ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/extensions/v1beta1/networkpolicies (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fdb7d8698d0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 02:38:14,513 5 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 02:38:15,501 5 INFO Syncing 'Namespace' objects
2017-04-05 02:38:15,514 5 INFO 3 existing Namespace(s) - add to queue
2017-04-05 02:38:15,514 5 INFO Syncing 'Pod' objects
2017-04-05 02:38:15,515 5 INFO Syncing 'NetworkPolicy' objects
2017-04-05 02:38:15,519 5 INFO Done getting Namespace(s) - new resourceVersion: 121892
2017-04-05 02:38:15,522 5 INFO Handled MODIFIED for Namespace: (None, u'default')
2017-04-05 02:38:15,530 5 INFO 0 existing NetworkPolicy(s) - add to queue
2017-04-05 02:38:15,530 5 INFO Done getting NetworkPolicy(s) - new resourceVersion: 121892
2017-04-05 02:38:15,544 5 INFO Handled MODIFIED for Namespace: (None, u'kube-public')
2017-04-05 02:38:15,546 5 INFO 31 existing Pod(s) - add to queue
2017-04-05 02:38:15,548 5 INFO Done getting Pod(s) - new resourceVersion: 121892
2017-04-05 02:38:15,554 5 INFO Handled MODIFIED for Namespace: (None, u'kube-system')
2017-04-05 02:38:15,555 5 INFO Labels for kube-system.calico-cni-ip-10-0-0-104.ap-northeast-1.compute.internal have been updated
2017-04-05 02:38:15,555 5 INFO No endpoint for kube-system.calico-cni-ip-10-0-0-104.ap-northeast-1.compute.internal in cache, loading
2017-04-05 02:38:15,558 5 INFO Loaded endpoint and label caches
mumoshu commented 7 years ago

Just noticed that https://github.com/kubernetes-incubator/kube-aws/issues/494#issuecomment-291687137 still persists:

2017-04-05 03:19:09,351 5 INFO Syncing 'Namespace' objects
2017-04-05 03:19:09,360 5 ERROR Unhandled exception killed Namespace manager
Traceback (most recent call last):
  File "<string>", line 311, in _manage_resource
  File "<string>", line 424, in _sync_resources
  File "site-packages/requests/models.py", line 808, in json
  File "site-packages/simplejson/__init__.py", line 516, in loads
  File "site-packages/simplejson/decoder.py", line 374, in decode
  File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Edit: This problem was due to the fact I removed /opt/bin/cni mount from kubelet, which resulted in kubelet to use an older version of calico-cni included in the /opt/cni/bin of the kubelet container, not the one from the latest calico-cni container.

mumoshu commented 7 years ago

install-cni, calico-node, calico-policy-controller seems to be working. However, kube-dns, heapster, kube-dashboard are still not.

$ kubesys get po
NAME                                                                    READY     STATUS              RESTARTS   AGE
calico-cni-ip-10-0-0-147.ap-northeast-1.compute.internal                1/1       Running             0          58m
calico-cni-ip-10-0-0-185.ap-northeast-1.compute.internal                1/1       Running             0          1h
calico-cni-ip-10-0-0-187.ap-northeast-1.compute.internal                1/1       Running             0          58m
calico-cni-ip-10-0-0-205.ap-northeast-1.compute.internal                1/1       Running             0          1h
calico-cni-ip-10-0-0-221.ap-northeast-1.compute.internal                1/1       Running             0          1h
calico-cni-ip-10-0-0-246.ap-northeast-1.compute.internal                1/1       Running             0          1h
calico-cni-ip-10-0-0-67.ap-northeast-1.compute.internal                 1/1       Running             0          1h
calico-node-dml3z                                                       1/1       Running             0          1h
calico-node-g0ghv                                                       1/1       Running             0          1h
calico-node-ndg8f                                                       1/1       Running             0          1h
calico-node-sw8l0                                                       1/1       Running             0          1h
calico-node-tfbkj                                                       1/1       Running             0          1h
calico-node-vk2x9                                                       1/1       Running             0          1h
calico-node-x7fq8                                                       1/1       Running             0          1h
calico-policy-controller-2947401832-r05x2                               1/1       Running             0          1h
heapster-v1.3.0-268032834-g0sgg                                         0/2       ContainerCreating   0          1h
kube-apiserver-ip-10-0-0-205.ap-northeast-1.compute.internal            1/1       Running             0          1h
kube-apiserver-ip-10-0-0-67.ap-northeast-1.compute.internal             1/1       Running             0          1h
kube-controller-manager-ip-10-0-0-205.ap-northeast-1.compute.internal   1/1       Running             0          1h
kube-controller-manager-ip-10-0-0-67.ap-northeast-1.compute.internal    1/1       Running             0          1h
kube-dns-3816048056-f5gf8                                               0/4       ContainerCreating   0          1h
kube-dns-3816048056-qvtfk                                               0/4       ContainerCreating   0          1h
kube-dns-782804071-hpd5p                                                0/4       ContainerCreating   0          1h
kube-dns-782804071-k2r0p                                                0/4       ContainerCreating   0          1h
kube-dns-autoscaler-1464605019-8tqzw                                    0/1       ContainerCreating   0          1h
kube-proxy-ip-10-0-0-147.ap-northeast-1.compute.internal                1/1       Running             0          58m
kube-proxy-ip-10-0-0-185.ap-northeast-1.compute.internal                1/1       Running             0          1h
kube-proxy-ip-10-0-0-187.ap-northeast-1.compute.internal                1/1       Running             0          58m
kube-proxy-ip-10-0-0-205.ap-northeast-1.compute.internal                1/1       Running             0          1h
kube-proxy-ip-10-0-0-221.ap-northeast-1.compute.internal                1/1       Running             0          1h
kube-proxy-ip-10-0-0-246.ap-northeast-1.compute.internal                1/1       Running             0          1h
kube-proxy-ip-10-0-0-67.ap-northeast-1.compute.internal                 1/1       Running             0          1h
kube-scheduler-ip-10-0-0-205.ap-northeast-1.compute.internal            1/1       Running             0          1h
kube-scheduler-ip-10-0-0-67.ap-northeast-1.compute.internal             1/1       Running             0          1h
kubernetes-dashboard-v1.5.1-x8rsr                                       0/1       ContainerCreating   0          1h
mumoshu commented 7 years ago

Hi @heschlie, any chance you could successfully deploy a self-hosted calico into a k8s 1.6.x cluster? I'm hitting various issues linked from/written here and have still no luck 😢

mumoshu commented 7 years ago

Anyway recreating all the kube-dns pods by decreasing num of replicas to 0 and then 3 doesn't resolve the problem:

Events:
  FirstSeen LastSeen    Count   From                            SubObjectPath   Type        Reason      Message
  --------- --------    -----   ----                            -------------   --------    ------      -------
  21s       21s     1   default-scheduler                           Normal      Scheduled   Successfully assigned kube-dns-3816048056-9d1l0 to ip-10-0-0-185.ap-northeast-1.compute.internal
  20s       19s     2   kubelet, ip-10-0-0-185.ap-northeast-1.compute.internal          Warning     FailedSync  Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3816048056-9d1l0_kube-system(336d8295-19be-11e7-ae4e-06cb38ebccfb)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3816048056-9d1l0_kube-system(336d8295-19be-11e7-ae4e-06cb38ebccfb)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3816048056-9d1l0_kube-system\" network: the server has asked for the client to provide credentials (get pods kube-dns-3816048056-9d1l0)"

  20s   6s  3   kubelet, ip-10-0-0-185.ap-northeast-1.compute.internal      Normal  SandboxChanged  Pod sandbox changed, it will be killed and re-created.
  18s   6s  2   kubelet, ip-10-0-0-185.ap-northeast-1.compute.internal      Warning FailedSync  Error syncing pod, skipping: failed to "KillPodSandbox" for "336d8295-19be-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-3816048056-9d1l0_kube-system\" network: open /var/lib/cni/flannel/e950e66dc0f1546791151fd43646d2043339f32026bc270e92918562b555497f: no such file or directory"
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722453    2412 remote_runtime.go:109] StopPodSandbox "f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "kubernetes-dashboard-v1.5.1-x8rsr_kube-system" network: open /var/lib/cni/flannel/f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e: no such file or directory
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722484    2412 kuberuntime_manager.go:784] Failed to stop sandbox {"docker" "f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e"}
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722539    2412 kuberuntime_manager.go:573] killPodWithSyncResult failed: failed to "KillPodSandbox" for "eeb274d9-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-v1.5.1-x8rsr_kube-system\" network: open /var/lib/cni/flannel/f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e: no such file or directory"
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722578    2412 pod_workers.go:182] Error syncing pod eeb274d9-19b1-11e7-ae4e-06cb38ebccfb ("kubernetes-dashboard-v1.5.1-x8rsr_kube-system(eeb274d9-19b1-11e7-ae4e-06cb38ebccfb)"), skipping: failed to "KillPodSandbox" for "eeb274d9-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-v1.5.1-x8rsr_kube-system\" network: open /var/lib/cni/flannel/f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e: no such file or directory"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: I0405 05:21:00.505954    2412 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/eeb65e13-19b1-11e7-ae4e-06cb38ebccfb-default-token-sj32n" (spec.Name: "default-token-sj32n") pod "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb" (UID: "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb").
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: I0405 05:21:00.704173    2412 kuberuntime_manager.go:458] Container {Name:autoscaler Image:gcr.io/google_containers/cluster-proportional-autoscaler-amd64:1.0.0 Command:[/cluster-proportional-autoscaler --namespace=kube-system --configmap=kube-dns-autoscaler --mode=linear --target=Deployment/kube-dns --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"min":2}} --logtostderr=true --v=2] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[memory:{i:{value:10485760 scale:0} d:{Dec:<nil>} s:10Mi Format:BinarySI} cpu:{i:{value:20 scale:-3} d:{Dec:<nil>} s:20m Format:DecimalSI}]} VolumeMounts:[{Name:default-token-sj32n ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: W0405 05:21:00.714516    2412 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system": Unexpected command output nsenter: cannot open : No such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]:  with error: exit status 1
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.720408    2412 cni.go:275] Error deleting network: open /var/lib/cni/flannel/5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal dockerd[2304]: time="2017-04-05T05:21:00.721425639Z" level=error msg="Handler for POST /v1.24/containers/5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c/stop returned error: Container 5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c is already stopped"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.722234    2412 remote_runtime.go:109] StopPodSandbox "5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system" network: open /var/lib/cni/flannel/5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.722759    2412 kuberuntime_manager.go:784] Failed to stop sandbox {"docker" "5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c"}
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: W0405 05:21:00.731622    2412 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system": Unexpected command output nsenter: cannot open : No such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]:  with error: exit status 1
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.736955    2412 cni.go:275] Error deleting network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal dockerd[2304]: time="2017-04-05T05:21:00.737908430Z" level=error msg="Handler for POST /v1.24/containers/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41/stop returned error: Container 9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41 is already stopped"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.738563    2412 remote_runtime.go:109] StopPodSandbox "9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system" network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.739003    2412 kuberuntime_manager.go:784] Failed to stop sandbox {"docker" "9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41"}
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.739561    2412 kuberuntime_manager.go:573] killPodWithSyncResult failed: failed to "KillPodSandbox" for "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-autoscaler-1464605019-8tqzw_kube-system\" network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.740143    2412 pod_workers.go:182] Error syncing pod eeb65e13-19b1-11e7-ae4e-06cb38ebccfb ("kube-dns-autoscaler-1464605019-8tqzw_kube-system(eeb65e13-19b1-11e7-ae4e-06cb38ebccfb)"), skipping: failed to "KillPodSandbox" for "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-autoscaler-1464605019-8tqzw_kube-system\" network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory"
core@ip-10-0-0-185 ~ $ sudo ls -lah /var/lib/cni/flannel
total 16K
drwx------. 2 root root 4.0K Apr  5 05:10 .
drwxr-xr-x. 4 root root 4.0K Apr  5 03:51 ..
mumoshu commented 7 years ago

k8s_auth_token in the calico cni config is empty:

ip-10-0-0-185 net.d # ls -lah /etc/kubernetes/cni/net.d/
total 40K
drwxr-xr-x. 3 root root 4.0K Apr  5 03:51 .
drwxr-xr-x. 3 root root 4.0K Apr  5 03:51 ..
-rw-r--r--. 1 root root  766 Apr  5 03:51 10-calico.conf
-rw-r--r--. 1 root root  275 Apr  5 03:51 calico-kubeconfig
drwxr-xr-x. 2 root root 4.0K Apr  5 03:51 calico-tls
ip-10-0-0-185 net.d # cat /etc/kubernetes/cni/net.d/10-calico.conf
{
  "name": "calico",
  "type": "flannel",
  "delegate": {
    "type": "calico",
    "etcd_endpoints": "https://ec2-52-199-161-70.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-54-65-166-197.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-52-198-185-242.ap-northeast-1.compute.amazonaws.com:2379",
    "etcd_key_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-key",
    "etcd_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-cert",
    "etcd_ca_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-ca",
    "log_level": "info",
    "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.3.0.1:443",
      "k8s_auth_token": ""
    },
    "kubernetes": {
      "kubeconfig": "/etc/kubernetes/cni/net.d/calico-kubeconfig"
    }
  }
}
mumoshu commented 7 years ago

Recreating the static pod calico-cni to run install-cni doesn't fix it.

ip-10-0-0-185 manifests # docker logs ef3de620ce42
Installing any TLS assets from /calico-secrets
Wrote Calico CNI binaries to /host/opt/cni/bin/
CNI plugin version: v1.6.1
cat: can't open '/var/run/secrets/kubernetes.io/serviceaccount/token': No such file or directory
Wrote CNI config: {
  "name": "calico",
  "type": "flannel",
  "delegate": {
    "type": "calico",
    "etcd_endpoints": "https://ec2-52-199-161-70.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-54-65-166-197.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-52-198-185-242.ap-northeast-1.compute.amazonaws.com:2379",
    "etcd_key_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-key",
    "etcd_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-cert",
    "etcd_ca_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-ca",
    "log_level": "info",
    "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.3.0.1:443",
      "k8s_auth_token": ""
    },
    "kubernetes": {
      "kubeconfig": "/etc/kubernetes/cni/net.d/calico-kubeconfig"
    }
  }
}
Done configuring CNI.  Sleep=true

Now, I noticed "cat: can't open '/var/run/secrets/kubernetes.io/serviceaccount/token': No such file or directory" in the log.

So, is a static pod unable to get a service account token via the ServiceAccount admission control? Is it probably due to the race between a static pod and the controller-manager? Then, we have to stick with the former daemonset deployment of calico-node+calico-ni(install-cni) but it results in the dead-lock I've tried to avoid in the first place.

mumoshu commented 7 years ago

Trying to figure out if I could revert part(s) of the current self-hosted calico integration to a non-self-hosted version before https://github.com/kubernetes-incubator/kube-aws/pull/124

gianrubio commented 7 years ago

@mumoshu I suggest you to ask in the calico slack channel. There's a specific channel for kubernetes there ;)

mumoshu commented 7 years ago

@gianrubio Thanks for the guidance! Sent a question there, at https://calicousers.slack.com/archives/C0BCA117T/p1491378938091121

mumoshu commented 7 years ago

E2E is passing in #492 with calico enabled 😄

Ran 151 of 588 Specs in 3665.346 seconds
SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS