Consul client external to kubernetes connection issue with server on kubernetes

liad5h commented 2 years ago

Question

I have a consul server (1.9.2) running on AWS EKS with 3 pods and no clients on EKS. EKS version: v1.18.20-eks-c9f1ce Helm chart version: 0.39.0 I am trying to connect a client running on AWS EC2 (docker) to the server. ports 8300-8302 (TCP) and 8301-8302 (UDP) are open both on the server and the client.

In consul members I see the client status is alive:

Node                    Address             Status  Type    Build  Protocol  DC            Segment
consul-consul-server-0  x.x.x.1:8301   alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-1  x.x.x.2:8301  alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-2  x.x.x.3:8301  alive   server  1.9.2  2         eu-central-1  <all>
1276ef89ca78            172.18.0.4:8301     alive   client  1.8.4  2         eu-central-1  <default>

But I still see the client go up and down in the consul UI. In the logs I see the following logs non-stop:

    2022-02-08T22:48:10.648Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-0 but other probes failed, network may be misconfigured
    2022-02-08T22:48:11.648Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-1 but other probes failed, network may be misconfigured
    2022-02-08T22:48:12.753Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-2 but other probes failed, network may be misconfigured
    2022-02-08T22:48:13.753Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-2 but other probes failed, network may be misconfigured
    2022-02-08T22:48:14.753Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-0 but other probes failed, network may be misconfigured
    2022-02-08T22:48:15.753Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-1 but other probes failed, network may be misconfigured
    2022-02-08T22:48:16.753Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-2 but other probes failed, network may be misconfigured

I tried working with the all of the following configurations for the consul server, all did not work:

in the helm chart - server.exposeGossipAndRPCPorts: true, client side host_network=true
in the helm chart - server.exposeGossipAndRPCPorts: false, client side host_network=false

Provide a clear description of the question you would like answered with as much detail as you can provide (links to docs, gists of commands). If you are reporting a feature request or issue, please use the other issue types instead. If appropriate, please use the sections below for providing more details around your configuration, CLI commands, logs, and environment details!

Please search the existing issues for relevant questions, and use the reaction feature (https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to add upvotes to pre-existing questions.

More details will help us answer questions more accurately and with less delay :) -->

CLI Commands (consul-k8s, consul-k8s-control-plane, helm)

Helm Configuration

client:
  enabled: false

global:
  datacenter: "${region}"
  gossipEncryption:
    secretName: "consul-gossip-encryption-key"
    secretKey: "key"
  tls:
    enabled: false
  acls:
    manageSystemACLs: true
  metrics:
    enabled: true
    enableAgentMetrics: true
server:
  enabled: true
ui:
  service:
    port:
      http: 8500
    type: "LoadBalancer"
    annotations: |
      "external-dns.alpha.kubernetes.io/hostname": "some-hostname"
      "service.beta.kubernetes.io/aws-load-balancer-backend-protocol": "http"
      "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
      "service.beta.kubernetes.io/aws-load-balancer-ip-address-type": "ipv4"
      "service.beta.kubernetes.io/load-balancer-source-ranges": "some-ranges"
      "service.beta.kubernetes.io/aws-load-balancer-internal": "true"

Logs

logs from the client:

/ # consul monitor -log-level=trace
2022-02-09T09:54:16.457Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-1 but other probes failed, network may be misconfigured
2022-02-09T09:54:16.957Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-2 (timeout reached)
2022-02-09T09:54:17.457Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-2 but other probes failed, network may be misconfigured
2022-02-09T09:54:17.958Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-2 (timeout reached)
2022-02-09T09:54:18.457Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-2 but other probes failed, network may be misconfigured
2022-02-09T09:54:18.958Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-0 (timeout reached)
2022-02-09T09:54:19.457Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-0 but other probes failed, network may be misconfigured
2022-02-09T09:54:19.958Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-1 (timeout reached)
2022-02-09T09:54:20.458Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-1 but other probes failed, network may be misconfigured
2022-02-09T09:54:20.958Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-1 (timeout reached)
2022-02-09T09:54:21.458Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-1 but other probes failed, network may be misconfigured
2022-02-09T09:54:21.958Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-2 (timeout reached)
2022-02-09T09:54:22.458Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-2 but other probes failed, network may be misconfigured
2022-02-09T09:54:22.958Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-0 (timeout reached)
2022-02-09T09:54:23.458Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to consul-consul-server-0 but other probes failed, network may be misconfigured
2022-02-09T09:54:23.959Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: consul-consul-server-0 (timeout reached)

logs from the server:

2022-02-09T09:55:41.818Z [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: 1276ef89ca78 (timeout reached)
2022-02-09T09:55:42.318Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect 1276ef89ca78 has failed, no acks received
2022-02-09T09:55:43.318Z [INFO]  agent.server.memberlist.lan: memberlist: Marking 1276ef89ca78 as failed, suspect timeout reached (2 peer confirmations)
2022-02-09T09:55:43.318Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: 1276ef89ca78 172.18.0.4
2022-02-09T09:55:43.318Z [INFO]  agent.server: member failed, marking health critical: member=1276ef89ca78

2022-02-09T09:55:58.984Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40288
2022-02-09T09:55:59.780Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:44248 latency=36.742µs
2022-02-09T09:55:59.985Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40292
2022-02-09T09:56:02.784Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:44252 latency=51.368µs
2022-02-09T09:56:03.985Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40300
2022-02-09T09:56:05.796Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:44284 latency=44.198µs
2022-02-09T09:56:07.603Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40308
2022-02-09T09:56:07.684Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: 1276ef89ca78 172.18.0.4
2022-02-09T09:56:07.684Z [INFO]  agent.server: member joined, marking health alive: member=1276ef89ca78
2022-02-09T09:56:07.986Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40310
2022-02-09T09:56:08.776Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:44290 latency=45.991µs
2022-02-09T09:56:09.818Z [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: 1276ef89ca78 (timeout reached)
2022-02-09T09:56:10.318Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect 1276ef89ca78 has failed, no acks received
2022-02-09T09:56:10.987Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40316
2022-02-09T09:56:11.804Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:44298 latency=38.998µs
2022-02-09T09:56:11.987Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=<ip of EC2 instance>:40350
2022-02-09T09:56:12.818Z [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: 1276ef89ca78 (timeout reached)
2022-02-09T09:56:13.318Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect 1276ef89ca78 has failed, no acks received
2022-02-09T09:56:13.910Z [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: consul-consul-server-1.eu-central-1 x.x.x.2:8302
2022-02-09T09:56:14.440Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: 1276ef89ca78 172.18.0.4
2022-02-09T09:56:14.440Z [INFO]  agent.server: member failed, marking health critical: member=1276ef89ca78

consul members from server, several checks with a few seconds difference:

/ $ consul members
Node                    Address             Status  Type    Build  Protocol  DC            Segment
consul-consul-server-0  x.x.x.1:8301   alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-1  x.x.x.2:8301  alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-2  x.x.x.3:8301  alive   server  1.9.2  2         eu-central-1  <all>
1276ef89ca78            172.18.0.4:8301     failed  client  1.8.4  2         eu-central-1  <default>

/ $ consul members
Node                    Address             Status  Type    Build  Protocol  DC            Segment
consul-consul-server-0  x.x.x.1:8301   alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-1  x.x.x.2:8301  alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-2  x.x.x.3:8301  alive   server  1.9.2  2         eu-central-1  <all>
1276ef89ca78            172.18.0.4:8301     alive   client  1.8.4  2         eu-central-1  <default>

/ $ consul members
Node                    Address             Status  Type    Build  Protocol  DC            Segment
consul-consul-server-0  x.x.x.1:8301   alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-1  x.x.x.2:8301  alive   server  1.9.2  2         eu-central-1  <all>
consul-consul-server-2  x.x.x.3:8301  alive   server  1.9.2  2         eu-central-1  <all>
1276ef89ca78            172.18.0.4:8301     failed  client  1.8.4  2         eu-central-1  <default>

Current understanding and Expected behavior

Environment details

Since the ports are opened and the client is sometimes connected to the servers, it is expected that it will remain connected.

Additional Context

liad5h commented 2 years ago

also adding consul client config file

{
    "server": false,
    "bind_addr": "0.0.0.0",
    "client_addr": "0.0.0.0",
    "datacenter": "eu-central-1",
    "data_dir": "/var/lib/consul",
    "log_level": "INFO",
    "retry_join": ["provider=k8s kubeconfig=/var/lib/consul/.kube.config namespace=consul label_selector=\"app=consul,component=server\""],
    "verify_incoming": false,
    "verify_incoming": false,
    "acl": {
        "tokens": {
            "agent": "<token>"
        },
        "enabled": true,
        "down_policy": "extend-cache",
        "enable_token_persistence": true
    },
    "encrypt": "<encrypt>",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "primary_datacenter": "eu-central-1"
}

t-eckert commented 2 years ago

Hi @liad5h, thank you so much for reaching out. It seems from what you have provided that you are doing everything right. We have seen flapping like that before when a user has not allowed access to their cluster via both TCP and UDP, but you clearly have.

I'm thinking this may have to do with an infrastructure problem where your ports may not be properly open for pod IPs.

I assume you have looked at this documentation already given the completeness of your question, but there is a docs page on running Consul clients outside of Kubernetes. There may be a detail in there that will help.

I'm sorry I can't see anything wrong with what you are sharing here.

liad5h commented 2 years ago

Hey @t-eckert

What is the best way to verify if one of my ports are closed? I used telnet for tcp and nc for udp.

i will try to follow this guide again and see if I missed anything

Maybe you can point me to the port that should cause such an issue if blocked?

liad5h commented 2 years ago

I found out that only client agents that are running on docker inside my ec2 instances are failing, I guess because their registered address is not routable.

When I set the advertise_addr property in the consul config file to the routable ip address of the instance, the issue is resolved.

hashicorp / consul-k8s