Closed mumoshu closed 7 years ago
So - how Self-hosted Calico can properly be configured after https://github.com/kubernetes/kubernetes/pull/43474?
AFAIK, the install-cni
container in the calico-node
pod had been populating /opt/bin/cni
on the host which is then mounted to the kubelet pod to provide the calico cni plugin.
However, the hyperkube image as of today seems to already contain the calico cni plugin in it:
$ exec into one of hyperkube containers
$ docker exec -it 1021179ecff8 sh
# ls /opt/cni/bin
bridge calico cnitool dhcp flannel host-local ipvlan loopback macvlan noop ptp tuning
So - would it work if we just stop mounting host's /opt/cni/bin
to kubelet containers which is done in https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-controller#L146-L148
?
Edit: It doesn't work because install-cni
also populates /etc/kubernetes/cni/net.d/
with a config file for cni.
install-cni
is working on every node. Now, kube-dns
,heapster
, kube-dashboard
are not working:
https://gist.github.com/mumoshu/2b873613f2b7ffa191ada89df9bc5523
Edit: calico-policy-controller
seems to never succeed to watch NetworkPolicy k8s objects:
$ kubesys logs -f calico-policy-controller-2947401832-c9p2q
*snip*
2017-04-05 00:34:34,590 6 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 00:34:35,588 6 INFO Syncing 'Pod' objects
2017-04-05 00:34:35,588 6 INFO Syncing 'Namespace' objects
2017-04-05 00:34:35,592 6 INFO Syncing 'NetworkPolicy' objects
2017-04-05 00:34:35,606 6 ERROR Unhandled exception killed Namespace manager
Traceback (most recent call last):
File "<string>", line 311, in _manage_resource
File "<string>", line 424, in _sync_resources
File "site-packages/requests/models.py", line 808, in json
File "site-packages/simplejson/__init__.py", line 516, in loads
File "site-packages/simplejson/decoder.py", line 374, in decode
File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2017-04-05 00:34:35,606 6 INFO Restarting watch on resource: Namespace
2017-04-05 00:34:35,610 6 ERROR Unhandled exception killed Pod manager
Traceback (most recent call last):
File "<string>", line 311, in _manage_resource
File "<string>", line 424, in _sync_resources
File "site-packages/requests/models.py", line 808, in json
File "site-packages/simplejson/__init__.py", line 516, in loads
File "site-packages/simplejson/decoder.py", line 374, in decode
File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2017-04-05 00:34:35,610 6 INFO Restarting watch on resource: Pod
2017-04-05 00:34:35,610 6 ERROR Unhandled exception killed NetworkPolicy manager
Traceback (most recent call last):
File "<string>", line 311, in _manage_resource
File "<string>", line 424, in _sync_resources
File "site-packages/requests/models.py", line 808, in json
File "site-packages/simplejson/__init__.py", line 516, in loads
File "site-packages/simplejson/decoder.py", line 374, in decode
File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2017-04-05 00:34:35,611 6 INFO Restarting watch on resource: NetworkPolicy
After upgrading calico to 2.1.1 with #496, calico-policy-controller emits different errors:
2017-04-05 01:31:16,453 8 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 01:31:17,424 8 INFO Syncing 'Pod' objects
2017-04-05 01:31:17,438 8 INFO Syncing 'Namespace' objects
2017-04-05 01:31:17,454 8 INFO Syncing 'NetworkPolicy' objects
2017-04-05 01:31:22,430 8 WARNING Connection attempt timed out: Pod ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/pods (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f584b4eaa90>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 01:31:22,430 8 INFO Restarting watch on resource: Pod
2017-04-05 01:31:22,442 8 WARNING Connection attempt timed out: Namespace ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/namespaces (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f584b4cbfd0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 01:31:22,443 8 INFO Restarting watch on resource: Namespace
2017-04-05 01:31:22,456 8 WARNING Connection attempt timed out: NetworkPolicy ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/extensions/v1beta1/networkpolicies (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f584b46e0d0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 01:31:22,457 8 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 01:31:23,430 8 INFO Syncing 'Pod' objects
2017-04-05 01:31:23,444 8 INFO Syncing 'Namespace' objects
2017-04-05 01:31:23,458 8 INFO Syncing 'NetworkPolicy' objects
Sounds very similar to https://github.com/projectcalico/k8s-policy/issues/42
~I believe that to resolve kubernetes.default
from calico-policy-controller, kube-dns must be running. However, kube-dns requires calico-policy-controller to be running - a dead-lock.~
calico-policy-controller container does have a /etc/hosts
entry to resolve kubernetes.default
without kube-dns. curl kubernetes.default
seems to just hang though:
core@ip-10-0-0-99 ~ $ docker ps | grep poli
e2dcbaeeaf2f quay.io/calico/kube-policy-controller@sha256:6ccc8caa6bf8ba94f5191532f5e279a307160cc9eb863e90b0200403bf8024d6 "/dist/controller" 22 minutes ago Up 22 minutes k8s_calico-policy-controller_calico-policy-controller-2947401832-lb933_kube-system_00ec08d6-199e-11e7-8582-06c6148c2095_0
35c416a4c751 gcr.io/google_containers/pause-amd64:3.0 "/pause" 24 minutes ago Up 24 minutes k8s_POD_calico-policy-controller-2947401832-lb933_kube-system_00ec08d6-199e-11e7-8582-06c6148c2095_0
core@ip-10-0-0-99 ~ $ docker exec -it e2dcbaeeaf2f sh
/ # curl -k https://kubernetes.default:443
sh: curl: not found
/ # apk add --update --no-cache curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20161130-r0)
(2/4) Installing libssh2 (1.7.0-r0)
(3/4) Installing libcurl (7.52.1-r1)
(4/4) Installing curl (7.52.1-r1)
Executing busybox-1.24.2-r13.trigger
Executing ca-certificates-20161130-r0.trigger
OK: 6 MiB in 15 packages
/ # curl -k https://kubernetes.default:443
^C
/ # cat /etc/hosts
# /etc/hosts: Local Host Database
#
# This file describes a number of aliases-to-address mappings for the for
# local hosts that share this file.
#
# The format of lines in this file is:
#
# IP_ADDRESS canonical_hostname [aliases...]
#
#The fields can be separated by any number of spaces or tabs.
#
# In the presence of the domain name service or NIS, this file may not be
# consulted at all; see /etc/host.conf for the resolution order.
#
# IPv4 and IPv6 localhost aliases
127.0.0.1 localhost
::1 localhost
#
# Imaginary network.
#10.0.0.2 myname
#10.0.0.3 myfriend
#
# According to RFC 1918, you can use the following IP networks for private
# nets which will never be connected to the Internet:
#
# 10.0.0.0 - 10.255.255.255
# 172.16.0.0 - 172.31.255.255
# 192.168.0.0 - 192.168.255.255
#
# In case you want to be able to connect directly to the Internet (i.e. not
# behind a NAT, ADSL router, etc...), you need real official assigned
# numbers. Do not try to invent your own network numbers but instead get one
# from your network provider (if any) or from your regional registry (ARIN,
# APNIC, LACNIC, RIPE NCC, or AfriNIC.)
#
10.3.0.1 kubernetes.default
An iptables rule for the k8s service 10.3.0.1
does exist but it is not resolvable from the container:
core@ip-10-0-0-99 ~ $ sudo iptables-save | grep 10.3.0.1/32
-A KUBE-SERVICES -d 10.3.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
core@ip-10-0-0-99 ~ $ docker exec -it e2dcbaeeaf2f sh
/ # curl -k https://10.3.0.1
^C
However, one of 2 underlying kube-apiserver pod ips are resolvable:
core@ip-10-0-0-99 ~ $ docker exec -it e2dcbaeeaf2f sh
/ # curl -k https://10.0.0.99
Unauthorized
/ # curl -k https://10.0.0.186
^C
/ # curl -k https://10.0.0.99
Unauthorized
/ # curl -k https://10.0.0.186
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.3.0.1
^C
/ # curl -k https://10.0.0.186
^C
To be short, it seems like calico-policy-controller
is trying communicate to the port 443 on one of controller nodes. As we don't have a security group to allow communication to the port 443 among controller nodes for now, it fails when the controller node(and its private ip) chosen randomly by iptables configured by kube-proxy is a node other than the one the calico-policy-controller
is running.
Adding the inbound rule to allow 443 among controller nodes to the controller SG fixed calico-policy-controller. You'll see that the connectivity issue has gone starting the middle of the log below:
2017-04-05 02:38:08,507 5 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 02:38:09,494 5 INFO Syncing 'Namespace' objects
2017-04-05 02:38:09,507 5 INFO Syncing 'Pod' objects
2017-04-05 02:38:09,509 5 INFO Syncing 'NetworkPolicy' objects
2017-04-05 02:38:14,500 5 WARNING Connection attempt timed out: Namespace ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/namespaces (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fdb7d8e4890>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 02:38:14,500 5 INFO Restarting watch on resource: Namespace
2017-04-05 02:38:14,513 5 WARNING Connection attempt timed out: Pod ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /api/v1/pods (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fdb7d8655d0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 02:38:14,513 5 INFO Restarting watch on resource: Pod
2017-04-05 02:38:14,513 5 WARNING Connection attempt timed out: NetworkPolicy ...HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/extensions/v1beta1/networkpolicies (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fdb7d8698d0>, 'Connection to kubernetes.default timed out. (connect timeout=5)'))
2017-04-05 02:38:14,513 5 INFO Restarting watch on resource: NetworkPolicy
2017-04-05 02:38:15,501 5 INFO Syncing 'Namespace' objects
2017-04-05 02:38:15,514 5 INFO 3 existing Namespace(s) - add to queue
2017-04-05 02:38:15,514 5 INFO Syncing 'Pod' objects
2017-04-05 02:38:15,515 5 INFO Syncing 'NetworkPolicy' objects
2017-04-05 02:38:15,519 5 INFO Done getting Namespace(s) - new resourceVersion: 121892
2017-04-05 02:38:15,522 5 INFO Handled MODIFIED for Namespace: (None, u'default')
2017-04-05 02:38:15,530 5 INFO 0 existing NetworkPolicy(s) - add to queue
2017-04-05 02:38:15,530 5 INFO Done getting NetworkPolicy(s) - new resourceVersion: 121892
2017-04-05 02:38:15,544 5 INFO Handled MODIFIED for Namespace: (None, u'kube-public')
2017-04-05 02:38:15,546 5 INFO 31 existing Pod(s) - add to queue
2017-04-05 02:38:15,548 5 INFO Done getting Pod(s) - new resourceVersion: 121892
2017-04-05 02:38:15,554 5 INFO Handled MODIFIED for Namespace: (None, u'kube-system')
2017-04-05 02:38:15,555 5 INFO Labels for kube-system.calico-cni-ip-10-0-0-104.ap-northeast-1.compute.internal have been updated
2017-04-05 02:38:15,555 5 INFO No endpoint for kube-system.calico-cni-ip-10-0-0-104.ap-northeast-1.compute.internal in cache, loading
2017-04-05 02:38:15,558 5 INFO Loaded endpoint and label caches
Just noticed that https://github.com/kubernetes-incubator/kube-aws/issues/494#issuecomment-291687137 still persists:
2017-04-05 03:19:09,351 5 INFO Syncing 'Namespace' objects
2017-04-05 03:19:09,360 5 ERROR Unhandled exception killed Namespace manager
Traceback (most recent call last):
File "<string>", line 311, in _manage_resource
File "<string>", line 424, in _sync_resources
File "site-packages/requests/models.py", line 808, in json
File "site-packages/simplejson/__init__.py", line 516, in loads
File "site-packages/simplejson/decoder.py", line 374, in decode
File "site-packages/simplejson/decoder.py", line 404, in raw_decode
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Edit: This problem was due to the fact I removed /opt/bin/cni mount from kubelet, which resulted in kubelet to use an older version of calico-cni included in the /opt/cni/bin of the kubelet container, not the one from the latest calico-cni container.
install-cni, calico-node, calico-policy-controller seems to be working. However, kube-dns, heapster, kube-dashboard are still not.
$ kubesys get po
NAME READY STATUS RESTARTS AGE
calico-cni-ip-10-0-0-147.ap-northeast-1.compute.internal 1/1 Running 0 58m
calico-cni-ip-10-0-0-185.ap-northeast-1.compute.internal 1/1 Running 0 1h
calico-cni-ip-10-0-0-187.ap-northeast-1.compute.internal 1/1 Running 0 58m
calico-cni-ip-10-0-0-205.ap-northeast-1.compute.internal 1/1 Running 0 1h
calico-cni-ip-10-0-0-221.ap-northeast-1.compute.internal 1/1 Running 0 1h
calico-cni-ip-10-0-0-246.ap-northeast-1.compute.internal 1/1 Running 0 1h
calico-cni-ip-10-0-0-67.ap-northeast-1.compute.internal 1/1 Running 0 1h
calico-node-dml3z 1/1 Running 0 1h
calico-node-g0ghv 1/1 Running 0 1h
calico-node-ndg8f 1/1 Running 0 1h
calico-node-sw8l0 1/1 Running 0 1h
calico-node-tfbkj 1/1 Running 0 1h
calico-node-vk2x9 1/1 Running 0 1h
calico-node-x7fq8 1/1 Running 0 1h
calico-policy-controller-2947401832-r05x2 1/1 Running 0 1h
heapster-v1.3.0-268032834-g0sgg 0/2 ContainerCreating 0 1h
kube-apiserver-ip-10-0-0-205.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-apiserver-ip-10-0-0-67.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-controller-manager-ip-10-0-0-205.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-controller-manager-ip-10-0-0-67.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-dns-3816048056-f5gf8 0/4 ContainerCreating 0 1h
kube-dns-3816048056-qvtfk 0/4 ContainerCreating 0 1h
kube-dns-782804071-hpd5p 0/4 ContainerCreating 0 1h
kube-dns-782804071-k2r0p 0/4 ContainerCreating 0 1h
kube-dns-autoscaler-1464605019-8tqzw 0/1 ContainerCreating 0 1h
kube-proxy-ip-10-0-0-147.ap-northeast-1.compute.internal 1/1 Running 0 58m
kube-proxy-ip-10-0-0-185.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-proxy-ip-10-0-0-187.ap-northeast-1.compute.internal 1/1 Running 0 58m
kube-proxy-ip-10-0-0-205.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-proxy-ip-10-0-0-221.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-proxy-ip-10-0-0-246.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-proxy-ip-10-0-0-67.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-scheduler-ip-10-0-0-205.ap-northeast-1.compute.internal 1/1 Running 0 1h
kube-scheduler-ip-10-0-0-67.ap-northeast-1.compute.internal 1/1 Running 0 1h
kubernetes-dashboard-v1.5.1-x8rsr 0/1 ContainerCreating 0 1h
Hi @heschlie, any chance you could successfully deploy a self-hosted calico into a k8s 1.6.x cluster? I'm hitting various issues linked from/written here and have still no luck 😢
Anyway recreating all the kube-dns pods by decreasing num of replicas to 0 and then 3 doesn't resolve the problem:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
21s 21s 1 default-scheduler Normal Scheduled Successfully assigned kube-dns-3816048056-9d1l0 to ip-10-0-0-185.ap-northeast-1.compute.internal
20s 19s 2 kubelet, ip-10-0-0-185.ap-northeast-1.compute.internal Warning FailedSync Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3816048056-9d1l0_kube-system(336d8295-19be-11e7-ae4e-06cb38ebccfb)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3816048056-9d1l0_kube-system(336d8295-19be-11e7-ae4e-06cb38ebccfb)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3816048056-9d1l0_kube-system\" network: the server has asked for the client to provide credentials (get pods kube-dns-3816048056-9d1l0)"
20s 6s 3 kubelet, ip-10-0-0-185.ap-northeast-1.compute.internal Normal SandboxChanged Pod sandbox changed, it will be killed and re-created.
18s 6s 2 kubelet, ip-10-0-0-185.ap-northeast-1.compute.internal Warning FailedSync Error syncing pod, skipping: failed to "KillPodSandbox" for "336d8295-19be-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-3816048056-9d1l0_kube-system\" network: open /var/lib/cni/flannel/e950e66dc0f1546791151fd43646d2043339f32026bc270e92918562b555497f: no such file or directory"
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722453 2412 remote_runtime.go:109] StopPodSandbox "f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "kubernetes-dashboard-v1.5.1-x8rsr_kube-system" network: open /var/lib/cni/flannel/f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e: no such file or directory
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722484 2412 kuberuntime_manager.go:784] Failed to stop sandbox {"docker" "f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e"}
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722539 2412 kuberuntime_manager.go:573] killPodWithSyncResult failed: failed to "KillPodSandbox" for "eeb274d9-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-v1.5.1-x8rsr_kube-system\" network: open /var/lib/cni/flannel/f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e: no such file or directory"
Apr 05 05:20:51 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:20:51.722578 2412 pod_workers.go:182] Error syncing pod eeb274d9-19b1-11e7-ae4e-06cb38ebccfb ("kubernetes-dashboard-v1.5.1-x8rsr_kube-system(eeb274d9-19b1-11e7-ae4e-06cb38ebccfb)"), skipping: failed to "KillPodSandbox" for "eeb274d9-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-v1.5.1-x8rsr_kube-system\" network: open /var/lib/cni/flannel/f94eecee3c8f4621650715b3d7011e74a887c664114a222d17820b0b78a9d80e: no such file or directory"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: I0405 05:21:00.505954 2412 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/eeb65e13-19b1-11e7-ae4e-06cb38ebccfb-default-token-sj32n" (spec.Name: "default-token-sj32n") pod "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb" (UID: "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb").
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: I0405 05:21:00.704173 2412 kuberuntime_manager.go:458] Container {Name:autoscaler Image:gcr.io/google_containers/cluster-proportional-autoscaler-amd64:1.0.0 Command:[/cluster-proportional-autoscaler --namespace=kube-system --configmap=kube-dns-autoscaler --mode=linear --target=Deployment/kube-dns --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"min":2}} --logtostderr=true --v=2] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[memory:{i:{value:10485760 scale:0} d:{Dec:<nil>} s:10Mi Format:BinarySI} cpu:{i:{value:20 scale:-3} d:{Dec:<nil>} s:20m Format:DecimalSI}]} VolumeMounts:[{Name:default-token-sj32n ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: W0405 05:21:00.714516 2412 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system": Unexpected command output nsenter: cannot open : No such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: with error: exit status 1
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.720408 2412 cni.go:275] Error deleting network: open /var/lib/cni/flannel/5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal dockerd[2304]: time="2017-04-05T05:21:00.721425639Z" level=error msg="Handler for POST /v1.24/containers/5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c/stop returned error: Container 5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c is already stopped"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.722234 2412 remote_runtime.go:109] StopPodSandbox "5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system" network: open /var/lib/cni/flannel/5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.722759 2412 kuberuntime_manager.go:784] Failed to stop sandbox {"docker" "5d4aa3beea5de868a4b10231bfeb7a33f05349b4d3844c5e37d5c027132b2a2c"}
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: W0405 05:21:00.731622 2412 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system": Unexpected command output nsenter: cannot open : No such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: with error: exit status 1
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.736955 2412 cni.go:275] Error deleting network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal dockerd[2304]: time="2017-04-05T05:21:00.737908430Z" level=error msg="Handler for POST /v1.24/containers/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41/stop returned error: Container 9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41 is already stopped"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.738563 2412 remote_runtime.go:109] StopPodSandbox "9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "kube-dns-autoscaler-1464605019-8tqzw_kube-system" network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.739003 2412 kuberuntime_manager.go:784] Failed to stop sandbox {"docker" "9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41"}
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.739561 2412 kuberuntime_manager.go:573] killPodWithSyncResult failed: failed to "KillPodSandbox" for "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-autoscaler-1464605019-8tqzw_kube-system\" network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory"
Apr 05 05:21:00 ip-10-0-0-185.ap-northeast-1.compute.internal kubelet-wrapper[2412]: E0405 05:21:00.740143 2412 pod_workers.go:182] Error syncing pod eeb65e13-19b1-11e7-ae4e-06cb38ebccfb ("kube-dns-autoscaler-1464605019-8tqzw_kube-system(eeb65e13-19b1-11e7-ae4e-06cb38ebccfb)"), skipping: failed to "KillPodSandbox" for "eeb65e13-19b1-11e7-ae4e-06cb38ebccfb" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-autoscaler-1464605019-8tqzw_kube-system\" network: open /var/lib/cni/flannel/9ec5bce64ac2da7b20507a34a95e12706c8ccdfe17e5f95417b442fe71dc3b41: no such file or directory"
core@ip-10-0-0-185 ~ $ sudo ls -lah /var/lib/cni/flannel
total 16K
drwx------. 2 root root 4.0K Apr 5 05:10 .
drwxr-xr-x. 4 root root 4.0K Apr 5 03:51 ..
k8s_auth_token in the calico cni config is empty:
ip-10-0-0-185 net.d # ls -lah /etc/kubernetes/cni/net.d/
total 40K
drwxr-xr-x. 3 root root 4.0K Apr 5 03:51 .
drwxr-xr-x. 3 root root 4.0K Apr 5 03:51 ..
-rw-r--r--. 1 root root 766 Apr 5 03:51 10-calico.conf
-rw-r--r--. 1 root root 275 Apr 5 03:51 calico-kubeconfig
drwxr-xr-x. 2 root root 4.0K Apr 5 03:51 calico-tls
ip-10-0-0-185 net.d # cat /etc/kubernetes/cni/net.d/10-calico.conf
{
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://ec2-52-199-161-70.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-54-65-166-197.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-52-198-185-242.ap-northeast-1.compute.amazonaws.com:2379",
"etcd_key_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-key",
"etcd_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-cert",
"etcd_ca_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-ca",
"log_level": "info",
"policy": {
"type": "k8s",
"k8s_api_root": "https://10.3.0.1:443",
"k8s_auth_token": ""
},
"kubernetes": {
"kubeconfig": "/etc/kubernetes/cni/net.d/calico-kubeconfig"
}
}
}
Recreating the static pod calico-cni
to run install-cni
doesn't fix it.
ip-10-0-0-185 manifests # docker logs ef3de620ce42
Installing any TLS assets from /calico-secrets
Wrote Calico CNI binaries to /host/opt/cni/bin/
CNI plugin version: v1.6.1
cat: can't open '/var/run/secrets/kubernetes.io/serviceaccount/token': No such file or directory
Wrote CNI config: {
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://ec2-52-199-161-70.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-54-65-166-197.ap-northeast-1.compute.amazonaws.com:2379,https://ec2-52-198-185-242.ap-northeast-1.compute.amazonaws.com:2379",
"etcd_key_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-key",
"etcd_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-cert",
"etcd_ca_cert_file": "/etc/kubernetes/cni/net.d/calico-tls/etcd-ca",
"log_level": "info",
"policy": {
"type": "k8s",
"k8s_api_root": "https://10.3.0.1:443",
"k8s_auth_token": ""
},
"kubernetes": {
"kubeconfig": "/etc/kubernetes/cni/net.d/calico-kubeconfig"
}
}
}
Done configuring CNI. Sleep=true
Now, I noticed "cat: can't open '/var/run/secrets/kubernetes.io/serviceaccount/token': No such file or directory" in the log.
So, is a static pod unable to get a service account token via the ServiceAccount admission control? Is it probably due to the race between a static pod and the controller-manager? Then, we have to stick with the former daemonset deployment of calico-node+calico-ni(install-cni) but it results in the dead-lock I've tried to avoid in the first place.
Trying to figure out if I could revert part(s) of the current self-hosted calico integration to a non-self-hosted version before https://github.com/kubernetes-incubator/kube-aws/pull/124
@mumoshu I suggest you to ask in the calico slack channel. There's a specific channel for kubernetes there ;)
@gianrubio Thanks for the guidance! Sent a question there, at https://calicousers.slack.com/archives/C0BCA117T/p1491378938091121
E2E is passing in #492 with calico enabled 😄
Ran 151 of 588 Specs in 3665.346 seconds
SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS
Due to the dead-lock between the installation of cni config for calico to
/etc/kubernetes/cni/net.d
and the node readiness explained in https://github.com/kubernetes/kubernetes/issues/43815Calico and other pods are pending:
due to the node being unready when the cni implementation(=calico) is not ready:
and results in the dead-lock.
Not sure how we could fix it yet.