hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Consul on Kubernetes Deployment: Was able to connect to Consul_Server_1 over TCP but UDP probes failed, network may be misconfigured #16601

Open oxycash opened 1 year ago

oxycash commented 1 year ago

Community Note


Overview of the Issue

Unable to connect Agents running on K8s to external Consul Servers which are running directly on VMs. We are not using official helm charts as of now.

Reproduction Steps

Install Consul in server mode on VMs (3 nodes).

{
    "addresses": {
        "dns": "127.0.0.1",
        "grpc": "127.0.0.1",
        "http": "127.0.0.1",
        "https": "127.0.0.1"
    },
    "advertise_addr": "{{ GetInterfaceIP \"ens192\" }}",
    "advertise_addr_wan": "{{ GetInterfaceIP \"ens192\" }}",
    "bind_addr": "{{ GetInterfaceIP \"ens192\" }}",
    "bootstrap": false,
    "bootstrap_expect": 3,
    "client_addr": "0.0.0.0",
    "data_dir": "/var/lib/consul",
    "datacenter": "dc1",
    "disable_update_check": true,
    "domain": "consul",
    "enable_local_script_checks": true,
    "enable_script_checks": true,
    "enable_syslog": true,
    "encrypt": "Some string",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "log_level": "INFO",
    "performance": {
        "leave_drain_time": "5s",
        "raft_multiplier": 1,
        "rpc_hold_timeout": "7s"
    },
    "ports": {
        "dns": 8600,
        "grpc": 8502,
        "http": 8500,
        "https": -1,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "server": 8300
    },
    "raft_protocol": 3,
    "retry_interval": "30s",
    "retry_interval_wan": "30s",
    "retry_join": [
        "Consul_Server_1",
        "Consul_Server_2",
        "Consul_Server_3"
    ],
    "retry_max": 0,
    "retry_max_wan": 0,
    "server": true,
    "syslog_facility": "local0",
    "translate_wan_addrs": false,
    "ui_config": {
        "enabled": false
    }
}

Client config:

{
    "addresses": {
        "dns": "127.0.0.1",
        "grpc": "127.0.0.1",
        "http": "127.0.0.1",
        "https": "127.0.0.1"
    },
    "advertise_addr": "{{ GetInterfaceIP \"eth0\" }}",
    "advertise_addr_wan": "{{ GetInterfaceIP \"eth0\" }}",
    "bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
    "client_addr": "127.0.0.1",
    "data_dir": "/var/lib/consul",
    "datacenter": "dc1",
    "disable_update_check": true,
    "domain": "consul",
    "enable_local_script_checks": true,
    "enable_script_checks": true,
    "enable_syslog": false,
    "encrypt": "some string",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "log_level": "INFO",
    "performance": {
        "leave_drain_time": "5s",
        "raft_multiplier": 1,
        "rpc_hold_timeout": "7s"
    },
    "ports": {
        "dns": 8600,
        "grpc": 8502,
        "http": 8500,
        "https": -1,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "server": 8300
    },
    "raft_protocol": 3,
    "retry_interval": "30s",
    "retry_join": [
         "Consul_Server_1",
        "Consul_Server_2",
        "Consul_Server_3"
    ],
    "retry_max": 0,
    "server": false,
    "syslog_facility": "local0",
    "translate_wan_addrs": false,
    "ui_config": {
        "enabled": false
    }
}

Client Docker Image:

FROM consul:latest

EXPOSE 80 8080 443 5432 6432 8000-8350 8500-8700 53

COPY config.json /etc/consul.d/client/config.json

ENTRYPOINT consul agent -config-dir /etc/consul.d/client

Client K8 Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: consul-deployment
  labels:
    app: consul
spec:
  selector:
    matchLabels:
      app: consul
  replicas: 1
  template:
    metadata:
      labels:
        app: consul
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: DC
                operator: In
                values:
                - derby
      containers:
      - name: consul-container
        image:  consul_test:0.2
        ports:
          - containerPort: 8500
            name: ui-port
          - containerPort: 8400
            name: alt-port
          - containerPort: 53
            name: udp-port
          - containerPort: 8443
            name: https-port
          - containerPort: 8080
            name: http-port
          - containerPort: 8301
            protocol: UDP
            name: serflan
          - containerPort: 8302
            name: serfwan
          - containerPort: 8600
            name: consuldns
          - containerPort: 8300
            name: server
         - containerPort: 8502
            name: gRPC
        volumeMounts:
        - name: consul-data
          mountPath: /data
      volumes:
        - name: consul-data
          emptyDir:
            sizeLimit: 5Gi

Logs

2023-03-10T14:09:29.695Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to Consul_Server_3 over TCP but UDP probes failed, network may be misconfigured
2023-03-10T14:09:30.196Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: Consul_Server_1 (timeout reached)
2023-03-10T14:09:30.696Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to Consul_Server_1 over TCP but UDP probes failed, network may be misconfigured
2023-03-10T14:09:30.798Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: Consul_Server_1 IP:8301
2023-03-10T14:09:30.799Z [WARN]  agent.client.memberlist.lan: memberlist: Refuting a suspect message (from: consul-deployment-59bf886df7-w88cx)
2023-03-10T14:09:31.197Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: Consul_Server_2 (timeout reached)
2023-03-10T14:09:31.696Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to Consul_Server_2 over TCP but UDP probes failed, network may be misconfigured
2023-03-10T14:09:32.197Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: Consul_Server_1 (timeout reached)
2023-03-10T14:09:32.697Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to Consul_Server_1 over TCP but UDP probes failed, network may be misconfigured
2023-03-10T14:09:33.198Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: Consul_Server_3 (timeout reached)
2023-03-10T14:09:33.698Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to Consul_Server_3 over TCP but UDP probes failed, network may be misconfigured
2023-03-10T14:09:34.199Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: Consul_Server_2 (timeout reached)
2023-03-10T14:09:34.699Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to Consul_Server_2 over TCP but UDP probes failed, network may be misconfigured
2023-03-10T14:09:35.200Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: Consul_Server_3 (timeout reached)

Expected behavior

Consul client should join consul server without errors.

Environment details

$ consul info   (client)

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 53f65dc3
        version = 1.15.0
        version_metadata =
consul:
        acl = disabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 54
        max_procs = 8
        os = linux
        version = go1.20.1
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 8
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 21696
        members = 4
        query_queue = 0
        query_time = 4
$ Consul info (server)

consul info
agent:
        check_monitors = 2
        check_ttls = 0
        checks = 4
        services = 2
build:
        prerelease =
        revision = 53f65dc3
        version = 1.15.0
        version_metadata =
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = Consul_Server_1:8300
        server = true
raft:
        applied_index = 698354
        commit_index = 698354
        fsm_pending = 0
        last_contact = 0
        last_log_index = 698354
        last_log_term = 169
        last_snapshot_index = 688389
        last_snapshot_term = 168
        latest_configuration = [{Suffrage:Voter ID: } {Suffrage:Voter ID: } {Suffrage:Voter ID: }]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 169
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 186
        max_procs = 8
        os = linux
        version = go1.20.1
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 8
        failed = 1
        health_score = 0
        intent_queue = 0
        left = 1
        member_time = 21698
        members = 5
        query_queue = 0
        query_time = 4
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1964
        members = 3
        query_queue = 0
        query_time = 4
oxycash commented 1 year ago

as per https://developer.hashicorp.com/consul/docs/architecture#lan-gossip-pool , if udp is not available agent will fall back to tcp. Does this make the consul client status frequently swing between alive/failed status?

because it is happening with us.

soupdiver commented 1 year ago

as per https://developer.hashicorp.com/consul/docs/architecture#lan-gossip-pool , if udp is not available agent will fall back to tcp. Does this make the consul client status frequently swing between alive/failed status?

because it is happening with us.

I'm not running on k8s but just inside a Docker container but have the same issue.

oxycash commented 1 year ago

@soupdiver if using hostnetwork is fine for your requirement, it will work. Else it's gonna be a problem. You can also try advertising node ip instead of pod IP, that means only one consul container per node.

soupdiver commented 1 year ago

@soupdiver if using hostnetwork is fine for your requirement, it will work. Else it's gonna be a problem. You can also try advertising node ip instead of pod IP, that means only one consul container per node.

yea using host network works but what is the underlying issue? Even if I expose the serf lan port via tcp and udp the error shows up.

oxycash commented 1 year ago

As per my deep dive, docker has limitations on how UDP works.

urosgruber commented 8 months ago

Same issue here, native 3 servers without any docker or VM in between. I've tested connections with nc and all good. But problem with consul persist.

mhdan commented 7 months ago

I have the same problem with consul servers running on k8s and consul clients outside of k8s with docker. the problem was related to the docker limitation with UDP. The only workaround I found was running docker clients with hostNetwork: true option.