Closed tpetrychyn closed 4 years ago
Hi, what is the output of kubectl get pods
? Are all the consul clients running ?
$ kubectl get pods ✔ 7s do-tor1-another-consul-test ⎈
NAME READY STATUS RESTARTS AGE
consul-consul-7g7bd 1/1 Running 0 91s
consul-consul-njfwh 1/1 Running 0 72s
consul-consul-server-0 1/1 Running 0 11m
consul-consul-server-1 1/1 Running 0 11m
consul-consul-server-2 1/1 Running 0 11m
consul-consul-sync-catalog-758c65bbf7-ht4jr 0/1 CrashLoopBackOff 7 11m
consul-consul-zsd4t 1/1 Running 0 54s
Thanks!
consul-consul-zsd4t says:
2020/01/09 21:49:08 [INFO] memberlist: Marking another-consul-test-ho79 as failed, suspect timeout reached (1 peer confirmations)
2020/01/09 21:49:08 [INFO] serf: EventMemberFailed: another-consul-test-ho79 10.137.100.74
2020/01/09 21:49:11 [WARN] memberlist: Was able to connect to another-consul-test-ho71 but other probes failed, network may be misconfigured
and consul-consul-server-1 is logging repeatedly:
2020/01/09 21:57:36 [INFO] memberlist: Marking another-consul-test-ho79 as failed, suspect timeout reached (1 peer confirmations)
2020/01/09 21:57:36 [INFO] serf: EventMemberFailed: another-consul-test-ho79 10.137.100.74
2020/01/09 21:57:36 [INFO] consul: member 'another-consul-test-ho79' failed, marking health critical
2020/01/09 21:57:37 [INFO] memberlist: Suspect another-consul-test-ho71 has failed, no acks received
Any idea? I pulled master this morning and did another fresh run, again only sync set true. The server and clients no longer log any network issues, they are quiet and running healthy for 100 minutes.
Sync catalog full output looks like
2020-01-13T16:55:53.670Z [INFO] to-consul/source: starting runner for endpoints
2020-01-13T16:55:53.670Z [INFO] to-k8s/sink: starting runner for syncing
2020-01-13T16:55:53.766Z [INFO] to-consul/controller: initial cache sync complete
2020-01-13T16:55:53.768Z [INFO] to-k8s/controller: initial cache sync complete
2020-01-13T16:55:53.769Z [INFO] to-k8s/sink: upsert: key=default/consul-consul-server
2020-01-13T16:55:53.770Z [INFO] to-k8s/sink: upsert: key=default/consul-consul-ui
2020-01-13T16:55:53.770Z [INFO] to-k8s/sink: upsert: key=default/consul-consul-dns
2020-01-13T16:55:53.770Z [INFO] to-k8s/sink: upsert: key=default/kubernetes
2020-01-13T16:55:53.772Z [INFO] to-consul/source.controller/endpoints: initial cache sync complete
2020-01-13T16:55:53.778Z [INFO] to-consul/source: upsert: key=default/kubernetes
2020-01-13T16:55:53.793Z [INFO] to-consul/source: upsert: key=default/consul-consul-server
2020-01-13T16:55:53.805Z [INFO] to-consul/source: upsert: key=default/consul-consul-ui
2020-01-13T16:55:53.814Z [INFO] to-consul/source: upsert: key=default/consul-consul-dns
2020-01-13T16:55:53.815Z [INFO] to-consul/source: upsert endpoint: key=default/consul-consul-server
2020-01-13T16:55:53.815Z [INFO] to-consul/source: upsert endpoint: key=default/consul-consul-ui
2020-01-13T16:55:53.815Z [INFO] to-consul/source: upsert endpoint: key=default/consul-consul-dns
2020-01-13T16:55:53.815Z [INFO] to-consul/source: upsert endpoint: key=default/kubernetes
2020-01-13T16:56:23.664Z [INFO] to-consul/sink: registering services
2020-01-13T16:56:23.666Z [INFO] to-consul/sink: starting service watcher: service-name=kubernetes-default
2020-01-13T16:56:23.670Z [INFO] to-consul/sink: starting service watcher: service-name=consul-consul-server-default
2020-01-13T16:56:23.671Z [INFO] to-consul/sink: starting service watcher: service-name=consul-consul-ui-default
2020-01-13T16:56:23.671Z [INFO] to-consul/sink: starting service watcher: service-name=consul-consul-dns-default
[GET /health/ready] Error getting leader status: Get https://10.137.248.149:8501/v1/status/leader: dial tcp 10.137.248.149:8501: i/o timeout
pod list looks like:
consul-consul-jmfj5 1/1 Running 0 105m 10.244.1.48 another-consul-test-ho7z <none> <none>
consul-consul-kftbw 1/1 Running 0 105m 10.244.0.84 another-consul-test-ho79 <none> <none>
consul-consul-server-0 1/1 Running 0 105m 10.244.1.1 another-consul-test-ho7z <none> <none>
consul-consul-server-1 1/1 Running 0 105m 10.244.0.29 another-consul-test-ho79 <none> <none>
consul-consul-server-2 1/1 Running 0 105m 10.244.0.135 another-consul-test-ho71 <none> <none>
consul-consul-sync-catalog-5b77c8f6ff-7vldg 0/1 Running 4 3m42s 10.244.1.95 another-consul-test-ho7z <none> <none>
consul-consul-vz8dw 1/1 Running 0 105m 10.244.0.230 another-consul-test-ho71 <none> <none>
We'll have to try it on DO ourselves. My guess is that the node IP and hostPort aren't working somehow and so the sync-catalog pod can't talk to the local agent on the nodeIP/hostport.
I have more information that may prove useful:
Let's call my three nodes A, B, and C. I spun up an ubuntu pod shell on each of them. My leader consul-server is on node A.
The 2 shells that do not share a node with the consul leader can hit it just fine.. i.e.:
$ curl http://10.137.228.158:8500/v1/status/leader
"10.244.1.60:8300"
However, the ubuntu living on node A times out on the above curl. Node A can however ping that IP:
root@my-shell2-8694f7f459-xxnbw:/# ping 10.137.228.158
PING 10.137.228.158 (10.137.228.158) 56(84) bytes of data.
64 bytes from 10.137.228.158: icmp_seq=1 ttl=63 time=0.098 ms
64 bytes from 10.137.228.158: icmp_seq=2 ttl=63 time=0.161 ms
^C
--- 10.137.228.158 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1016ms
rtt min/avg/max/mdev = 0.098/0.129/0.161/0.033 ms
root@my-shell2-8694f7f459-xxnbw:/# curl http://10.137.228.158:8500/v1/status/leader
curl: (7) Failed to connect to 10.137.228.158 port 8500: Connection timed out
Checking out kube-system
, the IP 10.137.228.158
corresponds to the following:
csi-do-node-s6xfp, kube-proxy-5x6x2, and cilium-hhl4q
all three of which are on Node A
Okay finally fixed it.. turns out I actually made this change back in May when my organization first setup consul in DO lol..
The workaround is add
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
to client-daemonset.yaml
This does feel like a workaround however, I think it would be valuable to investigate why it cannot work in DO normally.
I've managed to reproduce this. I think it's due to this issue in cilium: https://github.com/cilium/cilium/issues/9784.
I think support hostNetwork via a helm value might be a good solution as this seems to also occur in alibaba (although I don't know if hostNetwork works there as a fix).
The fix just got added to Cilium's roadmap for their 1.8 release. They don't have 1.7 out yet so it sounds like this will take a while.
Using latest master on a fresh Digital Ocean cluster and only changing syncCatalog to enabled: true results in the following error:
Cluster version is
TOR1 - 1.16.2-do.1
all default settings, I created a fresh cluster to test this.I do know that sync catalog works in v0.8.1 of this chart but is now broken in master. Any help is appreciated!