projectcalico / k8s-exec-plugin

[Deprecated] Kubernetes Exec Plugin
Apache License 2.0
47 stars 12 forks source link

Uprade to kubernetes 1.1 with calico-docker 0.10.0 issue #92

Closed Smana closed 8 years ago

Smana commented 9 years ago

Hello,

Following this issue : #34 I'm trying to upgrade my cluster. As suggested by Casey i've configured my node as follows :

calicoctl pool show
+----------------+---------+
|   IPv4 CIDR    | Options |
+----------------+---------+
| 10.233.64.0/18 |         |
+----------------+---------+
ip a sho dev cbr0
4: cbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 172.16.42.2/24 scope global cbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::8464:dbff:fec6:ac32/64 scope link 
       valid_lft forever preferred_lft forever
ps aux | grep docker
root      8056  2.6  1.8 1348252 38420 ?       Ssl  11:14   1:05 /usr/bin/docker daemon -H fd:// --bridge=cbr0 --iptables=false --ip-masq=false
Nov 12 12:01:58 kbnode1 kubelet[17740]: I1112 12:01:58.715419   17740 exec.go:149] Status 'exec' network plugin output: 2015-11-12 12:01:58,687 18467 INFO Begin Calico network plugin execution
Nov 12 12:01:58 kbnode1 kubelet[17740]: 2015-11-12 12:01:58,687 18467 INFO Plugin Args: ['/usr/libexec/kubernetes/kubelet-plugins/net/exec/calico/calico', 'status', 'kube-system', 'kubedns-louck', '4babfcc54b8f17787007a7cce3f00b9e764824b802b8f33e91e232c0
Nov 12 12:01:58 kbnode1 kubelet[17740]: 2015-11-12 12:01:58,688 18467 INFO Executing Calico pod-status hook
Nov 12 12:01:58 kbnode1 kubelet[17740]: 2015-11-12 12:01:58,694 18467 INFO Retrieved IP Address: 10.233.65.0
Nov 12 12:01:58 kbnode1 kubelet[17740]: {"ip": "10.233.65.0", "kind": "PodNetworkStatus", "apiVersion": "v1beta1"}
Nov 12 12:01:58 kbnode1 kubelet[17740]: 2015-11-12 12:01:58,694 18467 INFO Calico network plugin execution complete
Nov 12 12:01:58 kbnode1 kubelet[17740]: , <nil>
Nov 12 12:01:58 kbnode1 kubelet[17740]: E1112 12:01:58.715512   17740 manager.go:377] NetworkPlugin calico failed on the status hook for pod 'kubedns-louck' - invalid character '-' after top-level value
calicoctl endpoint show --d | grep kubedns-louc
| kbnode1  |      docker     | 4babfcc54b8f17787007a7cce3f00b9e764824b802b8f33e91e232c0d8a75322 | 17ed20da893511e5aafe5254008dc07e | 10.233.65.0/32 | 46:12:50:0d:be:ec | kube-system_kubedns-louck_4babfcc54b8f | active |
kubectl describe po kubedns-louck --namespace=kube-system | grep ^IP
IP:             172.16.42.6
6m      25s     9   {kubelet 10.115.99.11}  spec.containers{skydns}         Unhealthy   Liveness probe failed: Get http://172.16.42.6:8080/healthz: dial tcp 172.16.42.6:8080: no route to host

Is there some configuration parameter i've missed ?

kubelet is running with the following command

/usr/local/bin/kubelet --logtostderr=true --v=5 --api_servers=https://10.115.99.10:443 --address=0.0.0.0 --hostname_override=10.115.99.11 --allow_privileged=true --cluster_dns=10.233.0.10 --cluster_domain=cluster.local --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --config=/etc/kubernetes/manifests --network_plugin=calico
kubectl version
Client Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.1", GitCommit:"92635e23dfafb2ddc828c8ac6c03c7a7205a84d8", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.1", GitCommit:"92635e23dfafb2ddc828c8ac6c03c7a7205a84d8", GitTreeState:"clean"}
calicoctl version
0.10.0

Thank you

caseydavenport commented 9 years ago

Hi @Smana - I believe you're hitting this issue: https://github.com/projectcalico/calico-kubernetes/issues/79

This log tipped me off:

Nov 12 12:01:58 kbnode1 kubelet[17740]: E1112 12:01:58.715512 17740 manager.go:377] NetworkPlugin calico failed on the status hook for pod 'kubedns-louck' - invalid character '-' after top-level value

This should be fixed as of calico-kubernetes version v0.4.0 - you can upgrade the plugin with this command:

calicoctl node --ip=<IP> --kubernetes --kube-plugin-version=v0.5.0
Smana commented 9 years ago

Hi Casey and thank you for your help. I've tested with the option --kube-plugin-version and actually it fixed the ip assignement issue but now the networking doesn't work on any of my pods. Let's take the pod named "busybox" :

| kbnode1  |      docker     | c7bdae2f537ad10260f47fda623f42900172d98dfec941536a2e5ad55734795d | 465ccb5e898b11e586a65254008dc07e |  10.233.64.4/32 | 9a:67:e9:c6:76:b5 |     default_busybox-2_c7bdae2f537a     | active |
kubectl exec busybox-2 -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
107: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 9a:67:e9:c6:76:b5 brd ff:ff:ff:ff:ff:ff
    inet 10.233.64.4/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::9867:e9ff:fec6:76b5/64 scope link 
       valid_lft forever preferred_lft forever
kubectl describe po busybox-2 | grep ^IP
IP:             10.233.64.4

As you can see the ip address is properly defined and known by kubernetes. But when i check the routing table on the given node i can't see its route

default via 10.115.255.253 dev eth0 
10.0.0.0/8 via 10.115.255.254 dev eth0 
10.115.0.0/16 dev eth0  proto kernel  scope link  src 10.115.99.11 
blackhole 10.233.64.0/26  proto bird 
10.233.64.64/26 via 10.115.99.12 dev eth0  proto bird 
172.16.42.0/24 dev cbr0  proto kernel  scope link  src 172.16.42.2 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.42.1 

I think it is related to the bridge configuration but i didn't figured it out

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 10000
    link/ether 52:54:00:8d:c0:7e brd ff:ff:ff:ff:ff:ff
    inet 10.115.99.11/16 brd 10.115.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe8d:c07e/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:9f:21:69:9d brd ff:ff:ff:ff:ff:ff
    inet 172.17.42.1/16 scope global docker0
       valid_lft forever preferred_lft forever
4: cbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 172.16.42.2/24 scope global cbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::8464:dbff:fec6:ac32/64 scope link 
       valid_lft forever preferred_lft forever
96: cali23fa3ace898: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
...

The docker daemon is started on cbr0 bridge

ps aux | grep cbr0
root     19302  4.0  1.5 1464500 31416 ?       Ssl  22:17   0:33 /usr/bin/docker daemon -H fd:// --bridge=cbr0 --iptables=false --ip-masq=false

I don't know if it is relevant but i have the following logs :

Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,001 31421 [edf054db8115] INFO Plugin Args: ['/usr/libexec/kubernetes/kubelet-plugins/net/exec/calico/calico', 'setup', 'default', 'busybox', 'edf054db8115610147bbbf7ca7e2d6c21ffe1fb8c8f047332491f3c623403a2a']
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,002 31421 [edf054db8115] INFO Using LOG_LEVEL=INFO
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,002 31421 [edf054db8115] INFO Using ETCD_AUTHORITY=10.115.99.10:4001
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,002 31421 [edf054db8115] INFO Using KUBE_API_ROOT=http://10.115.99.10:8080/api/v1/
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,002 31421 [edf054db8115] INFO Using CALICO_IPAM=true
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,003 31421 [edf054db8115] INFO Using DEFAULT_POLICY=allow
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,003 31421 [edf054db8115] INFO Executing Calico pod-creation hook
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,009 31421 [edf054db8115] INFO Configuring docker container edf054db8115610147bbbf7ca7e2d6c21ffe1fb8c8f047332491f3c623403a2a
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,024 31421 [edf054db8115] INFO Deleting docker interface eth0
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,028 31421 [edf054db8115] INFO Container edf054db8115610147bbbf7ca7e2d6c21ffe1fb8c8f047332491f3c623403a2a running with PID 31413
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,028 31421 [edf054db8115] INFO Linking to container's netns
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,128 31421 [edf054db8115] INFO Configuring Calico network interface
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,135 31421 [edf054db8115] INFO Using Calico IPAM
Nov 12 22:40:35 kbnode1 kubelet[18146]: Traceback (most recent call last):
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 850, in emit
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 723, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 467, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: KeyError: 'identity'
Nov 12 22:40:35 kbnode1 kubelet[18146]: Logged from file pycalico.ipam, line 403
Nov 12 22:40:35 kbnode1 kubelet[18146]: Traceback (most recent call last):
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 850, in emit
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 723, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 467, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: KeyError: 'identity'
Nov 12 22:40:35 kbnode1 kubelet[18146]: Logged from file pycalico.ipam, line 403
Nov 12 22:40:35 kbnode1 kubelet[18146]: Traceback (most recent call last):
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 850, in emit
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 723, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 467, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: KeyError: 'identity'
Nov 12 22:40:35 kbnode1 kubelet[18146]: Logged from file pycalico.ipam, line 407
Nov 12 22:40:35 kbnode1 kubelet[18146]: Traceback (most recent call last):
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 850, in emit
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 723, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 467, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: KeyError: 'identity'
Nov 12 22:40:35 kbnode1 kubelet[18146]: Logged from file pycalico.ipam, line 407
Nov 12 22:40:35 kbnode1 kubelet[18146]: Traceback (most recent call last):
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 850, in emit
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 723, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 467, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: KeyError: 'identity'
Nov 12 22:40:35 kbnode1 kubelet[18146]: Logged from file pycalico.ipam, line 411
Nov 12 22:40:35 kbnode1 kubelet[18146]: Traceback (most recent call last):
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 850, in emit
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 723, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: File "/code/build/calico_kubernetes/out00-PYZ.pyz/logging", line 467, in format
Nov 12 22:40:35 kbnode1 kubelet[18146]: KeyError: 'identity'
Nov 12 22:40:35 kbnode1 kubelet[18146]: Logged from file pycalico.ipam, line 411
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,207 31421 [edf054db8115] INFO Creating endpoint with IPs [IPAddress('10.233.64.4')]
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,211 31421 [edf054db8115] INFO Creating the veth with namespace pid 31413 on interface name eth0
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,630 31421 [edf054db8115] INFO Setting mac address 46:de:cd:6a:51:8d to endpoint cali637bbce2898
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,642 31421 [edf054db8115] INFO Using IP Address 10.115.99.11
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,642 31421 [edf054db8115] INFO Adding IP 10.115.99.11 to interface cali637bbce2898
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,647 31421 [edf054db8115] INFO Finished configuring network interface
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,648 31421 [edf054db8115] INFO Getting API Resource: pods from KUBE_API_ROOT: http://10.115.99.10:8080/api/v1/
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,676 31421 [edf054db8115] INFO Configuring Pod Profile: default_busybox_edf054db8115
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,682 31421 [edf054db8115] WARNING No annotations found in pod {u'status': {u'phase': u'Pending', u'startTime': u'2015-11-12T22:40:33Z'}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'serviceAccountName': u'default', u'serviceAccount': u'default', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'secretName': u'default-token-sei98'}, u'name': u'default-token-sei98'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'busybox', u'image': u'busybox', u'volumeMounts': [{u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u'name': u'default-token-sei98'}], u'command': [u'sleep', u'3600'], u'imagePullPolicy': u'IfNotPresent', u'resources': {}}], u'nodeName': u'10.115.99.11'}, u'metadata': {u'name': u'busybox', u'namespace': u'default', u'resourceVersion': u'51798', u'creationTimestamp': u'2015-11-12T22:40:33Z', u'selfLink': u'/api/v1/namespaces/default/pods/busybox', u'uid': u'629f036e-898e-11e5-bd72-525400a0dafe'}}
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,730 31421 [edf054db8115] INFO Applying tags
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,737 31421 [edf054db8115] INFO Adding tag namespace_default
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,737 31421 [edf054db8115] WARNING No labels found in pod {u'status': {u'phase': u'Pending', u'startTime': u'2015-11-12T22:40:33Z'}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'serviceAccountName': u'default', u'serviceAccount': u'default', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'secretName': u'default-token-sei98'}, u'name': u'default-token-sei98'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'busybox', u'image': u'busybox', u'volumeMounts': [{u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u'name': u'default-token-sei98'}], u'command': [u'sleep', u'3600'], u'imagePullPolicy': u'IfNotPresent', u'resources': {}}], u'nodeName': u'10.115.99.11'}, u'metadata': {u'name': u'busybox', u'namespace': u'default', u'resourceVersion': u'51798', u'creationTimestamp': u'2015-11-12T22:40:33Z', u'selfLink': u'/api/v1/namespaces/default/pods/busybox', u'uid': u'629f036e-898e-11e5-bd72-525400a0dafe'}}
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,738 31421 [edf054db8115] WARNING No labels found in pod {u'status': {u'phase': u'Pending', u'startTime': u'2015-11-12T22:40:33Z'}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'serviceAccountName': u'default', u'serviceAccount': u'default', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'secretName': u'default-token-sei98'}, u'name': u'default-token-sei98'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'busybox', u'image': u'busybox', u'volumeMounts': [{u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u'name': u'default-token-sei98'}], u'command': [u'sleep', u'3600'], u'imagePullPolicy': u'IfNotPresent', u'resources': {}}], u'nodeName': u'10.115.99.11'}, u'metadata': {u'name': u'busybox', u'namespace': u'default', u'resourceVersion': u'51798', u'creationTimestamp': u'2015-11-12T22:40:33Z', u'selfLink': u'/api/v1/namespaces/default/pods/busybox', u'uid': u'629f036e-898e-11e5-bd72-525400a0dafe'}}
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,757 31421 [edf054db8115] INFO Finished applying tags.
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,758 31421 [edf054db8115] INFO Setting profile default_busybox_edf054db8115 on endpoint 637bbce2898e11e58f005254008dc07e
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,782 31421 [edf054db8115] INFO Finished configuring profile.
Nov 12 22:40:35 kbnode1 kubelet[18146]: 2015-11-12 22:40:35,783 31421 [edf054db8115] INFO Calico network plugin execution complete
Nov 12 22:40:35 kbnode1 kubelet[18146]: , <nil>
Nov 12 22:40:35 kbnode1 kubelet[18146]: I1112 22:40:35.844708   18146 hairpin.go:97] Enabling hairpin on interface cali637bbce2898
Nov 12 22:40:35 kbnode1 kubelet[18146]: W1112 22:40:35.844871   18146 manager.go:1875] Hairpin setup failed for pod "busybox_default": open /sys/devices/virtual/net/cali637bbce2898/brport/hairpin_mode: no such file or directory
Nov 12 22:40:36 kbnode1 kubelet[18146]: I1112 22:40:36.213766   18146 exec.go:149] Status 'exec' network plugin output: {"ip": "10.233.64.4", "kind": "PodNetworkStatus", "apiVersion": "v1beta1"}
Nov 12 22:40:36 kbnode1 kubelet[18146]: , <nil>

Your help is welcomed again :) Thanks

caseydavenport commented 9 years ago

@Smana - Interesting.

I wouldn't expect the bridge config to come into play here, since Calico isn't actually using the cbr0 or docker0 bridges, and the IP subnets assigned to them don't overlap with the Calico subnet.

It does look like you are missing the local route for that container - I'd expect a route that looks something like this -

10.233.64.4/32 dev cali465ccb5e

That route should be installed by the Calico agent felix - could you check the felix logs found in /var/log/calico/felix/ and paste the output here? I'm interested to see if there are any errors in that log.

It might also be useful to see the Kubernetes plugin logs - these can be found in the /var/log/calico/kubernetes/ directory.

Smana commented 9 years ago

I've the following error on /var/log/calico/felix

2015-11-12 22:17:03,776 [ERROR][5037/19] calico.felix.actor 320: _finish_msg_batch failed.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/calico/felix/actor.py", line 307, in _step
    self._finish_msg_batch(batch, results)
  File "/usr/lib/python2.7/site-packages/calico/felix/endpoint.py", line 329, in _finish_msg_batch
    self._configure_interface()
  File "/usr/lib/python2.7/site-packages/calico/felix/endpoint.py", line 526, in _configure_interface
    if not devices.interface_exists(self._iface_name):
  File "/usr/lib/python2.7/site-packages/calico/felix/devices.py", line 91, in interface_exists
    futils.check_call(["ip", "link", "list", interface])
  File "/usr/lib/python2.7/site-packages/calico/felix/futils.py", line 340, in check_call
    args, retcode, stdout, stderr, input=input_str)
FailedSystemCall: Failed system call (retcode : 1, args : ('ip', 'link', 'list', u'calieab3d8f4893'))
  stdout  : 
  stderr  : ip: can't find device 'calieab3d8f4893'

  input  : None

TrackedAsyncResult ('<disabled>', '<disabled>', 'LocalEndpoint(eab3d8f4893711e5b49e5254008dc07e(IPv4))', 'on_endpoint_update') was leaked with exception FailedSystemCall('ip', 'link', 'list', u'calieab3d8f4893').  Dying.

there are a huge number of this kind of errors in the logs.

Nothing relevant in the /var/log/calico/kubernetes

2015-11-13 08:35:17,026 13294 [68fb31373f44] INFO Begin Calico network plugin execution
2015-11-13 08:35:17,026 13294 [68fb31373f44] INFO Plugin Args: ['/usr/libexec/kubernetes/kubelet-plugins/net/exec/calico/calico', 'setup', 'default', 'busybox-2', '68fb31373f445d894dc0dc05097950403660421b5622781b729a9714fb246b75']
2015-11-13 08:35:17,026 13294 [68fb31373f44] INFO Using LOG_LEVEL=INFO
2015-11-13 08:35:17,027 13294 [68fb31373f44] INFO Using ETCD_AUTHORITY=10.115.99.10:4001
2015-11-13 08:35:17,027 13294 [68fb31373f44] INFO Using KUBE_API_ROOT=http://10.115.99.10:8080/api/v1/
2015-11-13 08:35:17,027 13294 [68fb31373f44] INFO Using CALICO_IPAM=true
2015-11-13 08:35:17,027 13294 [68fb31373f44] INFO Using DEFAULT_POLICY=allow
2015-11-13 08:35:17,028 13294 [68fb31373f44] INFO Executing Calico pod-creation hook
2015-11-13 08:35:17,032 13294 [68fb31373f44] INFO Configuring docker container 68fb31373f445d894dc0dc05097950403660421b5622781b729a9714fb246b75
2015-11-13 08:35:17,048 13294 [68fb31373f44] INFO Deleting docker interface eth0
2015-11-13 08:35:17,053 13294 [68fb31373f44] INFO Container 68fb31373f445d894dc0dc05097950403660421b5622781b729a9714fb246b75 running with PID 13232
2015-11-13 08:35:17,054 13294 [68fb31373f44] INFO Linking to container's netns
2015-11-13 08:35:17,177 13294 [68fb31373f44] INFO Configuring Calico network interface
2015-11-13 08:35:17,185 13294 [68fb31373f44] INFO Using Calico IPAM
2015-11-13 08:35:17,367 13294 [68fb31373f44] INFO Creating endpoint with IPs [IPAddress('10.233.64.6')]
2015-11-13 08:35:17,372 13294 [68fb31373f44] INFO Creating the veth with namespace pid 13232 on interface name eth0
2015-11-13 08:35:17,773 13294 [68fb31373f44] INFO Setting mac address 76:01:32:81:ee:7f to endpoint cali77b557ce89e
2015-11-13 08:35:17,783 13294 [68fb31373f44] INFO Using IP Address 10.115.99.11
2015-11-13 08:35:17,783 13294 [68fb31373f44] INFO Adding IP 10.115.99.11 to interface cali77b557ce89e
2015-11-13 08:35:17,787 13294 [68fb31373f44] INFO Finished configuring network interface
2015-11-13 08:35:17,788 13294 [68fb31373f44] INFO Getting API Resource: pods from KUBE_API_ROOT: http://10.115.99.10:8080/api/v1/
2015-11-13 08:35:17,818 13294 [68fb31373f44] INFO Configuring Pod Profile: default_busybox-2_68fb31373f44
2015-11-13 08:35:17,825 13294 [68fb31373f44] WARNING No annotations found in pod {u'status': {u'phase': u'Pending', u'startTime': u'2015-11-13T08:35:15Z'}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'serviceAccountName': u'default', u'serviceAccount': u'default', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'secretName': u'default-token-sei98'}, u'name': u'default-token-sei98'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'busybox', u'image': u'busybox', u'volumeMounts': [{u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u'name': u'default-token-sei98'}], u'command': [u'sleep', u'3600'], u'imagePullPolicy': u'IfNotPresent', u'resources': {}}], u'nodeName': u'10.115.99.11'}, u'metadata': {u'name': u'busybox-2', u'namespace': u'default', u'resourceVersion': u'63476', u'creationTimestamp': u'2015-11-13T08:35:15Z', u'selfLink': u'/api/v1/namespaces/default/pods/busybox-2', u'uid': u'769ebc93-89e1-11e5-bd72-525400a0dafe'}}
2015-11-13 08:35:17,913 13294 [68fb31373f44] INFO Applying tags
2015-11-13 08:35:17,920 13294 [68fb31373f44] INFO Adding tag namespace_default
2015-11-13 08:35:17,920 13294 [68fb31373f44] WARNING No labels found in pod {u'status': {u'phase': u'Pending', u'startTime': u'2015-11-13T08:35:15Z'}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'serviceAccountName': u'default', u'serviceAccount': u'default', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'secretName': u'default-token-sei98'}, u'name': u'default-token-sei98'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'busybox', u'image': u'busybox', u'volumeMounts': [{u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u'name': u'default-token-sei98'}], u'command': [u'sleep', u'3600'], u'imagePullPolicy': u'IfNotPresent', u'resources': {}}], u'nodeName': u'10.115.99.11'}, u'metadata': {u'name': u'busybox-2', u'namespace': u'default', u'resourceVersion': u'63476', u'creationTimestamp': u'2015-11-13T08:35:15Z', u'selfLink': u'/api/v1/namespaces/default/pods/busybox-2', u'uid': u'769ebc93-89e1-11e5-bd72-525400a0dafe'}}
2015-11-13 08:35:17,921 13294 [68fb31373f44] WARNING No labels found in pod {u'status': {u'phase': u'Pending', u'startTime': u'2015-11-13T08:35:15Z'}, u'spec': {u'dnsPolicy': u'ClusterFirst', u'serviceAccountName': u'default', u'serviceAccount': u'default', u'terminationGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'secretName': u'default-token-sei98'}, u'name': u'default-token-sei98'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'busybox', u'image': u'busybox', u'volumeMounts': [{u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u'name': u'default-token-sei98'}], u'command': [u'sleep', u'3600'], u'imagePullPolicy': u'IfNotPresent', u'resources': {}}], u'nodeName': u'10.115.99.11'}, u'metadata': {u'name': u'busybox-2', u'namespace': u'default', u'resourceVersion': u'63476', u'creationTimestamp': u'2015-11-13T08:35:15Z', u'selfLink': u'/api/v1/namespaces/default/pods/busybox-2', u'uid': u'769ebc93-89e1-11e5-bd72-525400a0dafe'}}
2015-11-13 08:35:17,955 13294 [68fb31373f44] INFO Finished applying tags.
2015-11-13 08:35:17,956 13294 [68fb31373f44] INFO Setting profile default_busybox-2_68fb31373f44 on endpoint 77b557ce89e111e5a8e15254008dc07e
2015-11-13 08:35:17,981 13294 [68fb31373f44] INFO Finished configuring profile.
2015-11-13 08:35:17,982 13294 [68fb31373f44] INFO Calico network plugin execution complete
Smana commented 9 years ago

In the meantime, i've installed a new cluster from scratch and it works better. My pods can ping each others but i can't reach the nodes subnet.

For instance my dns container needs to reach the master ip in order to forward dns request to my master

kubectl logs kubedns-tfcje -c skydns | tail -n 2
2015/11/13 10:52:57 skydns: failure to forward request "read udp 10.115.99.1:53: i/o timeout"
kubectl exec busybox -- ping -c 1 10.115.99.1
PING 10.115.99.1 (10.115.99.1): 56 data bytes

--- 10.115.99.1 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1

Is there something missing which allows to reach the nodes ips please ?

Anyway, it is interesting to solve the issue regarding the above felix errors

caseydavenport commented 9 years ago

@Smana

I think the skydns log is benign (unless you have a DNS service running at 10.115.99.1).

Let's see if we can figure out why the ping isn't working. Are you running on bare-meal? Or are you in a cloud-deployment (GCE, AWS, etc)?

If you're in a cloud deployment, make sure that you've enabled ipip as discussed here: https://github.com/projectcalico/calico-docker/blob/master/docs/FAQ.md#can-i-run-calico-in-a-public-cloud-environment

You'll likely want to perform this step as well: https://github.com/projectcalico/calico-docker/blob/master/docs/kubernetes/AWSIntegration.md#node-connectivity-workaround

If you're running on bare-metal, or you've performed the above steps and connectivity is still not working, I'd would be helpful to see the full IP routing table on both your minion and your master. Could you paste that here?

Smana commented 9 years ago

Actually the dns service is provided by a dnsmasq daemon located on the master server. My cluster is running on bare metal servers and here is the complete route tables.

Master

default via 10.115.255.253 dev eth0 
10.0.0.0/8 via 10.115.255.254 dev eth0 
10.115.0.0/16 dev eth0  proto kernel  scope link  src 10.115.99.1 
172.16.42.0/24 dev cbr0  proto kernel  scope link  src 172.16.42.1 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.42.1 
192.168.0.0/17 via 10.115.255.254 dev eth0 

Node

default via 10.115.255.253 dev eth0 
10.0.0.0/8 via 10.115.255.254 dev eth0 
10.115.0.0/16 dev eth0  proto kernel  scope link  src 10.115.99.2 
10.233.64.0 dev caliad614bcc89e  scope link 
blackhole 10.233.64.0/26  proto bird 
10.233.64.1 dev cali57af05cc89e  scope link 
10.233.64.64/26 via 10.115.99.3 dev eth0  proto bird 
172.16.42.0/24 dev cbr0  proto kernel  scope link  src 172.16.42.2 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.42.1 
192.168.0.0/17 via 10.115.255.254 dev eth0

Pod

default via 10.115.99.2 dev eth0 
10.115.99.2 dev eth0 

As i said, pods can ping each other.

kbmaster ~ 0 # kubectl exec busybox -- ping -c 1 10.233.64.64
PING 10.233.64.64 (10.233.64.64): 56 data bytes
64 bytes from 10.233.64.64: seq=0 ttl=62 time=0.689 ms

--- 10.233.64.64 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.689/0.689/0.689 ms

I really don't understand why i can't reach the network 10.115.0.0/16 network from a pod. Thank you again for your help.

caseydavenport commented 9 years ago

It looks to me like the master doesn't have any routes to the pod IP pool (10.233.0.0/16), so your traffic may be making it to the master, but cannot route back (unless the gateway 10.115.255.253 is pod-aware).

Usually I'd expect these routes to be learned over BGP - how have you configured your BGP topology? Are you running the calico/node container on your master?

Smana commented 9 years ago

My mistake, that was obvious. I removed the calico node on the master therefore there's no route back. Thank you Casey and sorry :p

Well, now everything works as expected when i deploy a cluster from scratch. Unfortunately i'm still facing my reboot issue. When i restart the server, the pods are started but networking is not working properly.

After the reboot the pods has an ip address

docker exec 3b2f767566a5 ip a sh dev eth0
19: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether aa:1f:8b:bd:98:39 brd ff:ff:ff:ff:ff:ff
    inet 10.233.64.72/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a81f:8bff:febd:9839/64 scope link 
       valid_lft forever preferred_lft forever

docker exec 3b2f767566a5 ip r
default via 10.115.99.11 dev eth0 
10.115.99.11 dev eth0 

But on the host the route to the veth interface is not created. Below is the route table of the node 1

default via 10.115.255.253 dev eth0 
10.0.0.0/8 via 10.115.255.254 dev eth0 
10.115.0.0/16 dev eth0  proto kernel  scope link  src 10.115.99.11 
10.233.64.0/26 via 10.115.99.12 dev eth0  proto bird 
blackhole 10.233.64.64/26  proto bird 
172.16.42.0/24 dev cbr0  proto kernel  scope link  src 172.16.42.2 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.42.1 
192.168.0.0/17 via 10.115.255.254 dev eth0 

The felix logs show the following logs

...
TrackedAsyncResult ('<disabled>', '<disabled>', 'LocalEndpoint(f29baba48d0c11e592645254006f763d(IPv6))', 'on_endpoint_update') was leaked with exception FailedSystemCall('ip', 'link', 'list', u'calif29baba48d0').  Dying.
2015-11-17 09:44:30,462 [WARNING][1801/20] calico.felix.devices 279: Cannot check flags for interface calif29baba48d0 (/sys/class/net/calif29baba48d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,463 [WARNING][1801/21] calico.felix.devices 279: Cannot check flags for interface calif2b400468d0 (/sys/class/net/calif2b400468d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,464 [WARNING][1801/22] calico.felix.devices 279: Cannot check flags for interface cali0f641e548d0 (/sys/class/net/cali0f641e548d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,465 [WARNING][1801/23] calico.felix.devices 279: Cannot check flags for interface cali93cfb4a88d0 (/sys/class/net/cali93cfb4a88d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,466 [WARNING][1801/24] calico.felix.devices 279: Cannot check flags for interface cali6e7479fc8d0 (/sys/class/net/cali6e7479fc8d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,467 [WARNING][1801/25] calico.felix.devices 279: Cannot check flags for interface cali6e63feb08d0 (/sys/class/net/cali6e63feb08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,468 [WARNING][1801/26] calico.felix.devices 279: Cannot check flags for interface cali0f46d6f08d0 (/sys/class/net/cali0f46d6f08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,473 [WARNING][1801/30] calico.felix.devices 279: Cannot check flags for interface calif29baba48d0 (/sys/class/net/calif29baba48d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,474 [WARNING][1801/31] calico.felix.devices 279: Cannot check flags for interface calif2b400468d0 (/sys/class/net/calif2b400468d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,475 [WARNING][1801/32] calico.felix.devices 279: Cannot check flags for interface cali0f641e548d0 (/sys/class/net/cali0f641e548d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,476 [WARNING][1801/33] calico.felix.devices 279: Cannot check flags for interface cali93cfb4a88d0 (/sys/class/net/cali93cfb4a88d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,477 [WARNING][1801/34] calico.felix.devices 279: Cannot check flags for interface cali6e7479fc8d0 (/sys/class/net/cali6e7479fc8d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,478 [WARNING][1801/35] calico.felix.devices 279: Cannot check flags for interface cali6e63feb08d0 (/sys/class/net/cali6e63feb08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,479 [WARNING][1801/36] calico.felix.devices 279: Cannot check flags for interface cali0f46d6f08d0 (/sys/class/net/cali0f46d6f08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:30,581 [ERROR][1801/30] calico.felix.actor 320: _finish_msg_batch failed.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/calico/felix/actor.py", line 307, in _step
    self._finish_msg_batch(batch, results)
  File "/usr/lib/python2.7/site-packages/calico/felix/endpoint.py", line 329, in _finish_msg_batch
    self._configure_interface()
  File "/usr/lib/python2.7/site-packages/calico/felix/endpoint.py", line 526, in _configure_interface
    if not devices.interface_exists(self._iface_name):
  File "/usr/lib/python2.7/site-packages/calico/felix/devices.py", line 91, in interface_exists
    futils.check_call(["ip", "link", "list", interface])
  File "/usr/lib/python2.7/site-packages/calico/felix/futils.py", line 340, in check_call
    args, retcode, stdout, stderr, input=input_str)
FailedSystemCall: Failed system call (retcode : 1, args : ('ip', 'link', 'list', u'calif29baba48d0'))
  stdout  : 
  stderr  : ip: can't find device 'calif29baba48d0'

  input  : None

TrackedAsyncResult ('<disabled>', '<disabled>', 'LocalEndpoint(f29baba48d0c11e592645254006f763d(IPv6))', 'on_endpoint_update') was leaked with exception FailedSystemCall('ip', 'link', 'list', u'calif29baba48d0').  Dying.
2015-11-17 09:44:32,129 [WARNING][1846/20] calico.felix.devices 279: Cannot check flags for interface calif29baba48d0 (/sys/class/net/calif29baba48d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,130 [WARNING][1846/21] calico.felix.devices 279: Cannot check flags for interface calif2b400468d0 (/sys/class/net/calif2b400468d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,132 [WARNING][1846/22] calico.felix.devices 279: Cannot check flags for interface cali0f641e548d0 (/sys/class/net/cali0f641e548d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,134 [WARNING][1846/23] calico.felix.devices 279: Cannot check flags for interface cali93cfb4a88d0 (/sys/class/net/cali93cfb4a88d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,135 [WARNING][1846/24] calico.felix.devices 279: Cannot check flags for interface cali6e7479fc8d0 (/sys/class/net/cali6e7479fc8d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,137 [WARNING][1846/25] calico.felix.devices 279: Cannot check flags for interface cali6e63feb08d0 (/sys/class/net/cali6e63feb08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,138 [WARNING][1846/26] calico.felix.devices 279: Cannot check flags for interface cali0f46d6f08d0 (/sys/class/net/cali0f46d6f08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,143 [WARNING][1846/30] calico.felix.devices 279: Cannot check flags for interface calif29baba48d0 (/sys/class/net/calif29baba48d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,148 [WARNING][1846/31] calico.felix.devices 279: Cannot check flags for interface calif2b400468d0 (/sys/class/net/calif2b400468d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,149 [WARNING][1846/32] calico.felix.devices 279: Cannot check flags for interface cali0f641e548d0 (/sys/class/net/cali0f641e548d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,151 [WARNING][1846/33] calico.felix.devices 279: Cannot check flags for interface cali93cfb4a88d0 (/sys/class/net/cali93cfb4a88d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,152 [WARNING][1846/34] calico.felix.devices 279: Cannot check flags for interface cali6e7479fc8d0 (/sys/class/net/cali6e7479fc8d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,154 [WARNING][1846/35] calico.felix.devices 279: Cannot check flags for interface cali6e63feb08d0 (/sys/class/net/cali6e63feb08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,155 [WARNING][1846/36] calico.felix.devices 279: Cannot check flags for interface cali0f46d6f08d0 (/sys/class/net/cali0f46d6f08d0/flags) - assume down: IOError(2, 'No such file or directory').
2015-11-17 09:44:32,318 [ERROR][1846/20] calico.felix.actor 320: _finish_msg_batch failed.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/calico/felix/actor.py", line 307, in _step
    self._finish_msg_batch(batch, results)
  File "/usr/lib/python2.7/site-packages/calico/felix/endpoint.py", line 329, in _finish_msg_batch
    self._configure_interface()
  File "/usr/lib/python2.7/site-packages/calico/felix/endpoint.py", line 526, in _configure_interface
    if not devices.interface_exists(self._iface_name):
  File "/usr/lib/python2.7/site-packages/calico/felix/devices.py", line 91, in interface_exists
    futils.check_call(["ip", "link", "list", interface])
  File "/usr/lib/python2.7/site-packages/calico/felix/futils.py", line 340, in check_call
    args, retcode, stdout, stderr, input=input_str)
FailedSystemCall: Failed system call (retcode : 1, args : ('ip', 'link', 'list', u'calif29baba48d0'))
  stdout  : 
  stderr  : ip: can't find device 'calif29baba48d0'

  input  : None

TrackedAsyncResult ('<disabled>', '<disabled>', 'LocalEndpoint(f29baba48d0c11e592645254006f763d(IPv4))', 'on_endpoint_update') was leaked with exception FailedSystemCall('ip', 'link', 'list', u'calif29baba48d0').  Dying.
...

And then i can't create pods anymore on this node because the cali interface is not created even for the new created pods :(

I've generated a diag file if needed : https://transfer.sh/UvGBx/diags-171115-095757.tar.gz

caseydavenport commented 9 years ago

@Smana - Thanks for the diags, I'm going to dig into this today.

I made sure to test this case when v1.1 was released and saw it work, so this might be a regression, or it might just be a subtle difference in the way the clusters are configured.

Hopefully we can get this sorted out once and for all :)

caseydavenport commented 9 years ago

From a log read, it appears that we're at least hitting this issue (though it may not be the cause of the reboot problem): https://github.com/projectcalico/libcalico/issues/47

An upgrade to the latest libcalico should do it. I've done that in #101 so it will go into the next release.

caseydavenport commented 9 years ago

@Smana - Alright, here is what I think.

We're hitting this exception in Felix:

FailedSystemCall: Failed system call (retcode : 1, args : ('ip', 'link', 'list', u'calif29baba48d0'))
  stdout  : 
  stderr  : ip: can't find device 'calif29baba48d0'

This is expected, since we currently leak endpoints on reboot due to this issue: https://github.com/kubernetes/kubernetes/issues/14940

I'd expect Felix to be resilient to a missing interface, but instead it looks like it is crashing and restarting, and thus preventing the other (valid) endpoints from being programmed. It looks like Felix tries to handle this, but doesn't - I've raised this issue against the Felix codebase: https://github.com/projectcalico/calico/issues/899.

I think the reason this worked when I tried it is that I was only trying on a small number of pods, which meant that it was a lot more likely Felix would process all the good endpoints first (rather than the stale ones). Once I tried with a larger number of pods, it stopped working.

Thanks for raising this and being so responsive. We'll try to fix this up for the next release.

Smana commented 9 years ago

I've upgraded to calico-docker version 0.11.0 released yesterday which includes libcalico v0.5.0. But i've the same issue (no ip routes after reboot).

On felix logs, i just have the following files

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/calico/etcddriver/__main__.py", line 31, in <module>
    from calico.etcddriver import driver
  File "/usr/lib/python2.7/site-packages/calico/etcddriver/driver.py", line 50, in <module>
    from ijson.backends import yajl2 as ijson
  File "/usr/lib/python2.7/site-packages/ijson/backends/yajl2.py", line 14, in <module>
    yajl = backends.find_yajl(2)
  File "/usr/lib/python2.7/site-packages/ijson/backends/__init__.py", line 19, in find_yajl
    raise YAJLImportError('YAJL shared object not found.')
ijson.backends.YAJLImportError: YAJL shared object not found.

I hope that will help

caseydavenport commented 9 years ago

@Smana - Just to keep you up to date, I think we have a fix for the felix issue, but it needs to propagate into a calico-docker release before you can try it out. I will keep you posted.

We're also working on fixing the upstream Kubernetes issue that ultimately causes this situation.

Smana commented 9 years ago

@caseydavenport - Thank you for the update, i'll look forward to the latest calico-docker release.

Smana commented 8 years ago

Hi guys! Do you have an idea when will it be released please ?

caseydavenport commented 8 years ago

@Smana - Sorry for the delay, I've been out for a week and just got back.

Here is the PR that fixes this issue - it hasn't made its way into a Felix release yet, hopefully that will be soon. https://github.com/projectcalico/calico/pull/902

caseydavenport commented 8 years ago

Hey @Smana - Not sure if you saw the calico-docker v0.13.0 release go by.

This includes the Felix changes needed to get node reboots working for you. I tried it out this morning and all seemed to work. Let me know when you get a chance to test it!

https://github.com/projectcalico/calico-docker/releases/tag/v0.13.0

Smana commented 8 years ago

Hello @caseydavenport, Thanks for the update. I did some tests this morning and yes it fixed the issue :)

Just one question : for a given pod, when the node is restarted, it doesn't keep its ip (another one is assigned). Maybe its normal

caseydavenport commented 8 years ago

@Smana - Yes, that is normal. Upon reboot Kubernetes is actually creating a whole new set of pods, so new IP addresses are assigned to them.

Smana commented 8 years ago

Thanks for your help, i closed the issue. The changes are applied to my playbooks : https://github.com/ansibl8s/setup-kubernetes See you soon.