Sync-Catalog i/o timeout

tpetrychyn commented 4 years ago

Using latest master on a fresh Digital Ocean cluster and only changing syncCatalog to enabled: true results in the following error:

2020-01-09T21:37:55.787Z [INFO ] to-consul/sink: starting service watcher: service-name=kubernetes-default
[GET /health/ready] Error getting leader status: Get http://10.137.120.213:8500/v1/status/leader: dial tcp 10.137.120.213:8500: i/o timeout

Cluster version is TOR1 - 1.16.2-do.1 all default settings, I created a fresh cluster to test this.

I do know that sync catalog works in v0.8.1 of this chart but is now broken in master. Any help is appreciated!

lkysow commented 4 years ago

Hi, what is the output of kubectl get pods? Are all the consul clients running ?

tpetrychyn commented 4 years ago

$ kubectl get pods                                                                                                                                                                                                                                 ✔  7s  do-tor1-another-consul-test ⎈
NAME                                          READY   STATUS             RESTARTS   AGE
consul-consul-7g7bd                           1/1     Running            0          91s
consul-consul-njfwh                           1/1     Running            0          72s
consul-consul-server-0                        1/1     Running            0          11m
consul-consul-server-1                        1/1     Running            0          11m
consul-consul-server-2                        1/1     Running            0          11m
consul-consul-sync-catalog-758c65bbf7-ht4jr   0/1     CrashLoopBackOff   7          11m
consul-consul-zsd4t                           1/1     Running            0          54s

Thanks!

consul-consul-zsd4t says:

    2020/01/09 21:49:08 [INFO] memberlist: Marking another-consul-test-ho79 as failed, suspect timeout reached (1 peer confirmations)
    2020/01/09 21:49:08 [INFO] serf: EventMemberFailed: another-consul-test-ho79 10.137.100.74
    2020/01/09 21:49:11 [WARN] memberlist: Was able to connect to another-consul-test-ho71 but other probes failed, network may be misconfigured

and consul-consul-server-1 is logging repeatedly:

    2020/01/09 21:57:36 [INFO] memberlist: Marking another-consul-test-ho79 as failed, suspect timeout reached (1 peer confirmations)
    2020/01/09 21:57:36 [INFO] serf: EventMemberFailed: another-consul-test-ho79 10.137.100.74
    2020/01/09 21:57:36 [INFO] consul: member 'another-consul-test-ho79' failed, marking health critical
    2020/01/09 21:57:37 [INFO] memberlist: Suspect another-consul-test-ho71 has failed, no acks received

tpetrychyn commented 4 years ago

Any idea? I pulled master this morning and did another fresh run, again only sync set true. The server and clients no longer log any network issues, they are quiet and running healthy for 100 minutes.

Sync catalog full output looks like

2020-01-13T16:55:53.670Z [INFO]  to-consul/source: starting runner for endpoints
2020-01-13T16:55:53.670Z [INFO]  to-k8s/sink: starting runner for syncing
2020-01-13T16:55:53.766Z [INFO]  to-consul/controller: initial cache sync complete
2020-01-13T16:55:53.768Z [INFO]  to-k8s/controller: initial cache sync complete
2020-01-13T16:55:53.769Z [INFO]  to-k8s/sink: upsert: key=default/consul-consul-server
2020-01-13T16:55:53.770Z [INFO]  to-k8s/sink: upsert: key=default/consul-consul-ui
2020-01-13T16:55:53.770Z [INFO]  to-k8s/sink: upsert: key=default/consul-consul-dns
2020-01-13T16:55:53.770Z [INFO]  to-k8s/sink: upsert: key=default/kubernetes
2020-01-13T16:55:53.772Z [INFO]  to-consul/source.controller/endpoints: initial cache sync complete
2020-01-13T16:55:53.778Z [INFO]  to-consul/source: upsert: key=default/kubernetes
2020-01-13T16:55:53.793Z [INFO]  to-consul/source: upsert: key=default/consul-consul-server
2020-01-13T16:55:53.805Z [INFO]  to-consul/source: upsert: key=default/consul-consul-ui
2020-01-13T16:55:53.814Z [INFO]  to-consul/source: upsert: key=default/consul-consul-dns
2020-01-13T16:55:53.815Z [INFO]  to-consul/source: upsert endpoint: key=default/consul-consul-server
2020-01-13T16:55:53.815Z [INFO]  to-consul/source: upsert endpoint: key=default/consul-consul-ui
2020-01-13T16:55:53.815Z [INFO]  to-consul/source: upsert endpoint: key=default/consul-consul-dns
2020-01-13T16:55:53.815Z [INFO]  to-consul/source: upsert endpoint: key=default/kubernetes
2020-01-13T16:56:23.664Z [INFO]  to-consul/sink: registering services
2020-01-13T16:56:23.666Z [INFO]  to-consul/sink: starting service watcher: service-name=kubernetes-default
2020-01-13T16:56:23.670Z [INFO]  to-consul/sink: starting service watcher: service-name=consul-consul-server-default
2020-01-13T16:56:23.671Z [INFO]  to-consul/sink: starting service watcher: service-name=consul-consul-ui-default
2020-01-13T16:56:23.671Z [INFO]  to-consul/sink: starting service watcher: service-name=consul-consul-dns-default
[GET /health/ready] Error getting leader status: Get https://10.137.248.149:8501/v1/status/leader: dial tcp 10.137.248.149:8501: i/o timeout

pod list looks like:

consul-consul-jmfj5                           1/1     Running   0          105m    10.244.1.48    another-consul-test-ho7z   <none>           <none>
consul-consul-kftbw                           1/1     Running   0          105m    10.244.0.84    another-consul-test-ho79   <none>           <none>
consul-consul-server-0                        1/1     Running   0          105m    10.244.1.1     another-consul-test-ho7z   <none>           <none>
consul-consul-server-1                        1/1     Running   0          105m    10.244.0.29    another-consul-test-ho79   <none>           <none>
consul-consul-server-2                        1/1     Running   0          105m    10.244.0.135   another-consul-test-ho71   <none>           <none>
consul-consul-sync-catalog-5b77c8f6ff-7vldg   0/1     Running   4          3m42s   10.244.1.95    another-consul-test-ho7z   <none>           <none>
consul-consul-vz8dw                           1/1     Running   0          105m    10.244.0.230   another-consul-test-ho71   <none>           <none>

lkysow commented 4 years ago

We'll have to try it on DO ourselves. My guess is that the node IP and hostPort aren't working somehow and so the sync-catalog pod can't talk to the local agent on the nodeIP/hostport.

tpetrychyn commented 4 years ago

I have more information that may prove useful:

Let's call my three nodes A, B, and C. I spun up an ubuntu pod shell on each of them. My leader consul-server is on node A.

The 2 shells that do not share a node with the consul leader can hit it just fine.. i.e.:

$ curl http://10.137.228.158:8500/v1/status/leader
"10.244.1.60:8300"

However, the ubuntu living on node A times out on the above curl. Node A can however ping that IP:

root@my-shell2-8694f7f459-xxnbw:/# ping 10.137.228.158
PING 10.137.228.158 (10.137.228.158) 56(84) bytes of data.
64 bytes from 10.137.228.158: icmp_seq=1 ttl=63 time=0.098 ms
64 bytes from 10.137.228.158: icmp_seq=2 ttl=63 time=0.161 ms
^C
--- 10.137.228.158 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1016ms
rtt min/avg/max/mdev = 0.098/0.129/0.161/0.033 ms
root@my-shell2-8694f7f459-xxnbw:/# curl http://10.137.228.158:8500/v1/status/leader
curl: (7) Failed to connect to 10.137.228.158 port 8500: Connection timed out

Checking out kube-system, the IP 10.137.228.158 corresponds to the following: csi-do-node-s6xfp, kube-proxy-5x6x2, and cilium-hhl4q all three of which are on Node A

tpetrychyn commented 4 years ago

Okay finally fixed it.. turns out I actually made this change back in May when my organization first setup consul in DO lol..

The workaround is add

      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true

to client-daemonset.yaml

tpetrychyn commented 4 years ago

This does feel like a workaround however, I think it would be valuable to investigate why it cannot work in DO normally.

lkysow commented 4 years ago

I've managed to reproduce this. I think it's due to this issue in cilium: https://github.com/cilium/cilium/issues/9784.

I think support hostNetwork via a helm value might be a good solution as this seems to also occur in alibaba (although I don't know if hostNetwork works there as a fix).

lkysow commented 4 years ago

The fix just got added to Cilium's roadmap for their 1.8 release. They don't have 1.7 out yet so it sounds like this will take a while.

ishustava commented 4 years ago

Hey @tpetrychyn,

We now support setting hostNetwork for the Consul clients (as of release 0.22.0). It also sounds like a more long-term fix will come Cilium eventually. I'm going to close this, but please let us know if you are still seeing problems.

hashicorp / consul-helm

Sync-Catalog i/o timeout #327