hashicorp / consul-helm

Helm chart to install Consul and other associated components.
Mozilla Public License 2.0
419 stars 385 forks source link

0.32.1/1.10.0 WAN Federation consul-server-acl-init job failing to execute resulting in failed helm installation #1039

Closed shellfu closed 3 years ago

shellfu commented 3 years ago

Overview of the Issue

Installation of Primary datacenter went well and by the book and was installed with the following chart CHART: consul-0.32.1 CONSUL: 1.10.0

Installation of Secondary datacenter failed to execute the consul-server-acl-init job in the secondary datacenter with federation enabled. CHART: consul-0.32.1 CONSUL: 1.10.0

Reproduction Steps

1. Installation of Consul in the Primary DC

helm install consul -n consul -f $BELOW_VALUES hashicorp/consul --set global.name=consul

client:
  enabled: true
  grpc: true
connectInject:
  aclBindingRuleSelector: serviceaccount.name!=default
  default: false
  enabled: true
  metrics:
    defaultEnableMerging: true
    defaultEnabled: true
    defaultMergedMetricsPort: 20100
    defaultPrometheusScrapePath: /metrics
    defaultPrometheusScrapePort: 20200
  transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
controller:
  enabled: true
dns:
  enabled: true
global:
  acls:
    createReplicationToken: true
    manageSystemACLs: true
  datacenter: primary
  enabled: true
  federation:
    createFederationSecret: true
    enabled: true
  gossipEncryption:
    secretKey: key
    secretName: consul-gossip-encryption-key
  image: hashicorp/consul:1.10.0
  imageEnvoy: envoyproxy/envoy-alpine:v1.18.3
  imageK8S: hashicorp/consul-k8s:0.26.0
  logJSON: true
  metrics:
    agentMetricsRetentionTime: 1m
    enableAgentMetrics: false
    enableGatewayMetrics: true
    enabled: true
  name: consul
  tls:
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    serverAdditionalDNSSANs:
    - '*.consul'
    - '*.svc.cluster.local'
    - '*.de.example.net'
    verify: false
meshGateway:
  enabled: true
  service:
    enabled: true
    port: 443
    type: LoadBalancer
  wanAddress:
    port: 443
    source: Service
server:
  bootstrapExpect: 5
  connect: true
  disruptionBudget:
    enabled: true
    maxUnavailable: 2
  enabled: true
  extraConfig: |
    {
      "primary_datacenter": "primary",
      "performance": {
        "raft_multiplier": 3
      },
      "dns_config": {
        "allow_stale": true,
        "cache_max_age": "10s",
        "enable_additional_node_meta_txt": false,
        "node_ttl": "1m",
        "soa": {
          "expire": 86400,
          "min_ttl": 30,
          "refresh": 3600,
          "retry": 600
        },
        "use_cache": true
      },
      "telemetry": {
        "prometheus_retention_time": "30s",
        "dogstatsd_addr": "localhost:8125",
        "disable_hostname": true
      },
      "ui_config": {
        "dashboard_url_templates": {
          "service": "https://grafana-1.monitoring.example.net:3000/d/lDlaj-NGz/service-overview?orgId=1&var-service={{ "{{" }}Service.Name}}&var-namespace={{ "{{" }}Service.Namespace}}&var-dc={{ "{{" }}Datacenter}}"
        }
      }
    }
  replicas: 5
  resources:
    limits:
      cpu: 500m
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 10Gi
  storage: 10Gi
syncCatalog:
  default: true
  enabled: true
  nodePortSyncType: ExternalFirst
  syncClusterIPServices: true
  toConsul: true
  toK8S: true
ui:
  enabled: true
  metrics:
    baseURL: "http://mon-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"
    enabled: true
    provider: prometheus
  service:
    enabled: true
    type: NodePort

2. Create the Proxy Default in Primary

kubectl apply -n consul -f $BELOW_YAML

apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  config:
    envoy_prometheus_bind_addr: 0.0.0.0:9102
    protocol: http
  meshGateway: 
    mode: 'local'

3. Export consul-federation from Primary and Import to Secondary kubectl get secret -n consul consul-federation -oyaml > secret.consul-federation.yaml kucbectl config use-context secondary kubectl apply -n consul -f secret.consul-federation.yaml

4. Installation of Consul in Secondary DC

helm install consul -n consul -f $BELOW_VALUES hashicorp/consul --set global.name=consul

client:
  enabled: true
  grpc: true
connectInject:
  aclBindingRuleSelector: serviceaccount.name!=default
  default: false
  enabled: true
  metrics:
    defaultEnableMerging: true
    defaultEnabled: true
    defaultMergedMetricsPort: 20100
    defaultPrometheusScrapePath: /metrics
    defaultPrometheusScrapePort: 20200
  transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
controller:
  enabled: true
dns:
  enabled: true
global:
  acls:
    manageSystemACLs: true
    replicationToken:
      secretKey: replicationToken
      secretName: consul-federation
  datacenter: secondary
  enabled: true
  federation:
    enabled: true
  gossipEncryption:
    secretKey: gossipEncryptionKey
    secretName: consul-federation
  image: hashicorp/consul:1.10.0
  imageEnvoy: envoyproxy/envoy-alpine:v1.18.3
  imageK8S: hashicorp/consul-k8s:0.26.0
  logJSON: true
  metrics:
    agentMetricsRetentionTime: 1m
    enableAgentMetrics: false
    enableGatewayMetrics: true
    enabled: true
  name: consul
  tls:
    caCert:
      secretKey: caCert
      secretName: consul-federation
    caKey:
      secretKey: caKey
      secretName: consul-federation
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    serverAdditionalDNSSANs:
    - '*.consul'
    - '*.svc.cluster.local'
    - '*.de.example.net'
    verify: false
meshGateway:
  enabled: true
  service:
    enabled: true
    port: 443
    type: LoadBalancer
  wanAddress:
    port: 443
    source: Service
server:
  bootstrapExpect: 5
  connect: true
  disruptionBudget:
    enabled: true
    maxUnavailable: 2
  enabled: true
  # Here we're including the server config exported from the primary
  # via the federation secret. This config includes the addresses of
  # the primary datacenter's mesh gateways so Consul can begin federation.
  extraVolumes:
    - type: secret
      name: consul-federation
      items:
        - key: serverConfigJSON
          path: config.json
      load: true
  extraConfig: |
    {
      "primary_datacenter": "primary",
      "performance": {
        "raft_multiplier": 3
      },
      "dns_config": {
        "allow_stale": true,
        "cache_max_age": "10s",
        "enable_additional_node_meta_txt": false,
        "node_ttl": "1m",
        "soa": {
          "expire": 86400,
          "min_ttl": 30,
          "refresh": 3600,
          "retry": 600
        },
        "use_cache": true
      },
      "telemetry": {
        "prometheus_retention_time": "30s",
        "dogstatsd_addr": "localhost:8125",
        "disable_hostname": true
      },
      "ui_config": {
        "dashboard_url_templates": {
          "service": "https://grafana-1.monitoring.example.net:3000/d/lDlaj-NGz/service-overview?orgId=1&var-service={{ "{{" }}Service.Name}}&var-namespace={{ "{{" }}Service.Namespace}}&var-dc={{ "{{" }}Datacenter}}"
        }
      }
    }
  replicas: 5
  resources:
    limits:
      cpu: 500m
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 10Gi
  storage: 10Gi
syncCatalog:
  default: true
  enabled: true
  nodePortSyncType: ExternalFirst
  syncClusterIPServices: true
  toConsul: true
  toK8S: true
ui:
  enabled: true
  metrics:
    baseURL: "http://mon-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"
    enabled: true
    provider: prometheus
  service:
    enabled: true
    type: NodePort

Expected behavior

I expect the secondary datacenter to come online but federated with the primary.

Environment details

Additional Context

kubectl get pods -n consul

$ kubectl get pods -n consul
NAME                                                        READY   STATUS     RESTARTS   AGE
consul-872wq                                                0/1     Init:0/1   0          34m
consul-connect-injector-webhook-deployment-8cf76674-cks2h   0/1     Init:0/2   0          34m
consul-connect-injector-webhook-deployment-8cf76674-pm6sp   0/1     Init:0/2   0          34m
consul-controller-6d998c67c8-f958f                          0/1     Init:0/2   0          34m
consul-lc78g                                                0/1     Init:0/1   0          34m
consul-mesh-gateway-8544494849-mhcpd                        0/2     Init:1/3   0          34m
consul-mesh-gateway-8544494849-tv46x                        0/2     Init:1/3   0          34m
consul-mvggh                                                0/1     Init:0/1   0          34m
consul-r797z                                                0/1     Init:0/1   0          34m
consul-server-0                                             1/1     Running    0          34m
consul-server-1                                             1/1     Running    0          34m
consul-server-2                                             1/1     Running    0          34m
consul-server-3                                             1/1     Running    0          34m
consul-server-4                                             1/1     Running    0          34m
consul-server-acl-init-2k4z8                                1/1     Running    0          9m28s
consul-server-acl-init-4wrtt                                0/1     Error      0          19m
consul-server-acl-init-9sf76                                0/1     Error      0          29m
consul-sync-catalog-7cdc945454-b96tn                        0/1     Init:0/2   0          34m
consul-vb9zv                                                0/1     Init:0/1   0          34m
consul-vnx4t                                                0/1     Init:0/1   0          34m
consul-webhook-cert-manager-66bc8fb64f-wrmq5                1/1     Running    0          34m

Additionally, you can view the logs from the acl init job after the servers successfully came online.

kubectl logs -n consul consul-server-acl-init-2k4z8

2021-07-15T18:59:35.664Z [INFO]  ACL replication is enabled so skipping Consul server ACL bootstrapping
2021-07-15T18:59:36.069Z [ERROR] Failure: calling /agent/self to get datacenter: err="Unexpected response code: 403 (ACL not found)"
2021-07-15T18:59:36.069Z [INFO]  Retrying in 1s
2021-07-15T18:59:37.070Z [ERROR] Failure: calling /agent/self to get datacenter: err="Unexpected response code: 403 (ACL not found)"
2021-07-15T18:59:37.070Z [INFO]  Retrying in 1s
2021-07-15T18:59:38.072Z [ERROR] Failure: calling /agent/self to get datacenter: err="Unexpected response code: 403 (ACL not found)"
2021-07-15T18:59:38.072Z [INFO]  Retrying in 1s

Since this job never successfully completes the secondary region cannot come online due to missing secrets such as the client token, mesh gateway token etc...

lkysow commented 3 years ago

Hi, can we get the server logs too please. For acl init to succeed the ACL replication must be working which is done on the servers.

shellfu commented 3 years ago

Logs from Consul Server 0

==> Starting Consul agent...
           Version: '1.10.0'
           Node ID: '102b54b4-e8c5-e9a0-0fa4-c9e96252ff84'
         Node name: 'consul-server-0'
        Datacenter: 'secondary' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: -1, HTTPS: 8501, gRPC: -1, DNS: 8600)
      Cluster Addr: 10.200.72.45 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: true

==> Log data will now stream in as it occurs:

2021-07-15T18:36:59.251Z [WARN]  agent: bootstrap_expect > 0: expecting 5 servers
2021-07-15T18:36:59.251Z [WARN]  agent: if auto_encrypt.allow_tls is turned on, either verify_incoming or verify_incoming_rpc should be enabled. It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2021-07-15T18:36:59.350Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 5 servers
2021-07-15T18:36:59.350Z [WARN]  agent.auto_config: if auto_encrypt.allow_tls is turned on, either verify_incoming or verify_incoming_rpc should be enabled. It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2021-07-15T18:36:59.376Z [INFO]  agent.server.gateway_locator: will dial the primary datacenter using our local mesh gateways if possible
2021-07-15T18:36:59.417Z [INFO]  agent.server.raft: initial configuration: index=0 servers=[]
2021-07-15T18:36:59.417Z [INFO]  agent.server.raft: entering follower state: follower="Node at 10.200.72.45:8300 [Follower]" leader=
2021-07-15T18:36:59.418Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-0.secondary 10.200.72.45
2021-07-15T18:36:59.419Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-server-0 10.200.72.45
2021-07-15T18:36:59.419Z [INFO]  agent.router: Initializing LAN area manager
2021-07-15T18:36:59.419Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {10.200.72.45:8300 0 consul-server-0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.72.45:8300: operation was canceled". Reconnecting...
2021-07-15T18:36:59.419Z [INFO]  agent.server: Adding LAN server: server="consul-server-0 (Addr: tcp/10.200.72.45:8300) (DC: secondary)"
2021-07-15T18:36:59.419Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
2021-07-15T18:36:59.419Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-0.secondary area=wan
2021-07-15T18:36:59.420Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {10.200.72.45:8300 0 consul-server-0.secondary <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.72.45:8300: operation was canceled". Reconnecting...
2021-07-15T18:36:59.420Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2021-07-15T18:36:59.420Z [INFO]  agent: Starting server: address=0.0.0.0:8501 network=tcp protocol=https
2021-07-15T18:36:59.420Z [WARN]  agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-07-15T18:36:59.420Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-07-15T18:36:59.420Z [INFO]  agent: Joining cluster...: cluster=LAN
2021-07-15T18:36:59.420Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.consul.svc:8301, consul-server-1.consul-server.consul.svc:8301, consul-server-2.consul-server.consul.svc:8301, consul-server-3.consul-server.consul.svc:8301, consul-server-4.consul-server.consul.svc:8301]
2021-07-15T18:36:59.420Z [INFO]  agent: started state syncer
2021-07-15T18:36:59.421Z [INFO]  agent: Consul agent running!
2021-07-15T18:36:59.421Z [INFO]  agent: Refreshing mesh gateways is supported for the following discovery methods: discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-07-15T18:36:59.421Z [INFO]  agent: Refreshing mesh gateways...
2021-07-15T18:36:59.421Z [INFO]  agent.server.gateway_locator: updated fallback list of primary mesh gateways: mesh_gateways=[10.61.69.111:443]
2021-07-15T18:36:59.421Z [INFO]  agent: Refreshing mesh gateways completed
2021-07-15T18:36:59.421Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=WAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-07-15T18:36:59.421Z [INFO]  agent: Joining cluster...: cluster=WAN
2021-07-15T18:36:59.421Z [INFO]  agent: (WAN) joining: wan_addresses=[*.primary/192.0.2.2]
2021-07-15T18:36:59.421Z [WARN]  agent: (WAN) couldn't join: number_of_nodes=0 error="1 error occurred:
    * Failed to join 192.0.2.2: Remote DC has no server currently reachable

"
2021-07-15T18:36:59.421Z [WARN]  agent: Join cluster failed, will retry: cluster=WAN retry_interval=30s error=<nil>
2021-07-15T18:36:59.466Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-0.consul-server.consul.svc:8301: lookup consul-server-0.consul-server.consul.svc on 10.100.200.10:53: no such host
2021-07-15T18:36:59.498Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-server-3 10.200.82.70
2021-07-15T18:36:59.499Z [INFO]  agent.server: Adding LAN server: server="consul-server-3 (Addr: tcp/10.200.82.70:8300) (DC: secondary)"
2021-07-15T18:36:59.499Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-server-4 10.200.91.50
2021-07-15T18:36:59.499Z [INFO]  agent.server: Adding LAN server: server="consul-server-4 (Addr: tcp/10.200.91.50:8300) (DC: secondary)"
2021-07-15T18:36:59.499Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-server-2 10.200.23.52
2021-07-15T18:36:59.500Z [INFO]  agent.server: Adding LAN server: server="consul-server-2 (Addr: tcp/10.200.23.52:8300) (DC: secondary)"
2021-07-15T18:36:59.500Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-server-1 10.200.27.53
2021-07-15T18:36:59.500Z [INFO]  agent.server: Adding LAN server: server="consul-server-1 (Addr: tcp/10.200.27.53:8300) (DC: secondary)"
2021-07-15T18:36:59.501Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-1.secondary 10.200.27.53
2021-07-15T18:36:59.502Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-2.secondary 10.200.23.52
2021-07-15T18:36:59.502Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-1.secondary area=wan
2021-07-15T18:36:59.502Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-4.secondary 10.200.91.50
2021-07-15T18:36:59.502Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-2.secondary area=wan
2021-07-15T18:36:59.502Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-3.secondary 10.200.82.70
2021-07-15T18:36:59.502Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-4.secondary area=wan
2021-07-15T18:36:59.503Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-3.secondary area=wan
2021-07-15T18:36:59.507Z [INFO]  agent.server: Found expected number of peers, attempting bootstrap: peers=10.200.23.52:8300,10.200.27.53:8300,10.200.72.45:8300,10.200.82.70:8300,10.200.91.50:8300
2021-07-15T18:36:59.597Z [INFO]  agent: (LAN) joined: number_of_nodes=4
2021-07-15T18:36:59.597Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=4
2021-07-15T18:37:00.320Z [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=primary method=ACL.GetPolicy
2021-07-15T18:37:00.321Z [INFO]  agent.server: New leader elected: payload=consul-server-1
2021-07-15T18:37:00.321Z [WARN]  agent: Node info update blocked by ACLs: node=102b54b4-e8c5-e9a0-0fa4-c9e96252ff84 accessorID=legacy-token
2021-07-15T18:37:00.459Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:01.461Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:02.462Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:03.105Z [WARN]  agent: Node info update blocked by ACLs: node=102b54b4-e8c5-e9a0-0fa4-c9e96252ff84 accessorID=legacy-token
2021-07-15T18:37:03.464Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:04.466Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:05.467Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:06.468Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:07.471Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:08.472Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:08.871Z [ERROR] agent: Failed to check for updates: error="Get "https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=8bbe665c-08b6-09cb-005f-5930e2ed4b87&version=1.10.0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-07-15T18:37:09.474Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:10.475Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:11.477Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:12.478Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:12.885Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-0.primary 10.200.67.28
2021-07-15T18:37:12.885Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-4.primary 10.200.93.40
2021-07-15T18:37:12.885Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-0.primary area=wan
2021-07-15T18:37:12.885Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-3.primary 10.200.37.38
2021-07-15T18:37:12.886Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-4.primary area=wan
2021-07-15T18:37:12.886Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-3.primary area=wan
2021-07-15T18:37:12.886Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-1.primary 10.200.41.47
2021-07-15T18:37:12.886Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-2.primary 10.200.65.46
2021-07-15T18:37:12.886Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-1.primary area=wan
2021-07-15T18:37:12.886Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-2.primary area=wan
2021-07-15T18:37:13.007Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.65.46:8302: read tcp 10.200.72.45:48552->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:13.480Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:13.509Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.37.38:8302: read tcp 10.200.72.45:48564->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:13.596Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.67.28:8302: read tcp 10.200.72.45:48566->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:14.007Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.93.40:8302: read tcp 10.200.72.45:48572->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:14.481Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:14.507Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.41.47:8302: read tcp 10.200.72.45:48582->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:14.594Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.65.46:8302: read tcp 10.200.72.45:48584->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:15.007Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.41.47:8302: read tcp 10.200.72.45:48586->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:15.096Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.37.38:8302: read tcp 10.200.72.45:48588->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:15.483Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:15.508Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.37.38:8302: read tcp 10.200.72.45:48594->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:16.093Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=legacy-token
2021-07-15T18:37:16.484Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:17.486Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:18.487Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:19.006Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.65.46:8302: read tcp 10.200.72.45:48608->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:19.094Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.93.40:8302: read tcp 10.200.72.45:48610->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:19.183Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.41.47:8302: read tcp 10.200.72.45:48612->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:19.489Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"

After the above it repeats.

shellfu commented 3 years ago

Looking at the above logs this looks suspect

2021-07-15T18:36:59.421Z [INFO]  agent: Joining cluster...: cluster=WAN
2021-07-15T18:36:59.421Z [INFO]  agent: (WAN) joining: wan_addresses=[*.primary/192.0.2.2]
2021-07-15T18:36:59.421Z [WARN]  agent: (WAN) couldn't join: number_of_nodes=0 error="1 error occurred:
    * Failed to join 192.0.2.2: Remote DC has no server currently reachable

Checking connectivity between the two regions. If that job relies on the servers being up and replicating acl then that makes sense as to the failure.

Though I do see the same output in my other two clusters running 1.8.10 hmm...

shellfu commented 3 years ago

Aha and on the primary I see.

[WARN] agent.server.rpc; RPC request to DC is currently failing as no server can be reached: datacenter=secondary

Maybe remove the BUG tag until some more digging is done. This may be a connectivity issue. I will try this exact install in two new clusters and report.

shellfu commented 3 years ago

hmm nc verified connectivity between the two Kubernetes clusters.

More digging to do to see if I can uncover any more details.

lkysow commented 3 years ago

The sequence is:

  1. Secondary servers connect with primary mesh gateways, start ACL replication
  2. This allows server-acl-init in secondary to complete
  3. server-acl-init generates the mesh-gateway ACL token
  4. mesh gateways start
  5. secondary servers tell primary servers location of its mesh gateways
  6. Up until this point the primary has not been able to reach the secondary so it will be complaining but that's expected
  7. once the primary can talk to the secondaries mesh gateways then the logs will stop erroring

So at step 1, check if the secondary server pods can connect with the primary's mesh gateways.

shellfu commented 3 years ago

I have connected with nc, verified the route to the nodes in the primary from the secondary with traceroute.

Going to try a few more things to obtain more data.

shellfu commented 3 years ago

The sequence above is super helpful. I am certainly facing some sort of network issue between the clusters as the connectivity was able to be established.

The helm chart succeeded when I logged into the init job container and established a connection with the mesh gateway via NC then it all went as I would expect. I am closing this issue as it is verified to be a network problem and not a bug with the helm chart.