Closed shellfu closed 3 years ago
Hi, can we get the server logs too please. For acl init to succeed the ACL replication must be working which is done on the servers.
Logs from Consul Server 0
==> Starting Consul agent...
Version: '1.10.0'
Node ID: '102b54b4-e8c5-e9a0-0fa4-c9e96252ff84'
Node name: 'consul-server-0'
Datacenter: 'secondary' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [0.0.0.0] (HTTP: -1, HTTPS: 8501, gRPC: -1, DNS: 8600)
Cluster Addr: 10.200.72.45 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: true
==> Log data will now stream in as it occurs:
2021-07-15T18:36:59.251Z [WARN] agent: bootstrap_expect > 0: expecting 5 servers
2021-07-15T18:36:59.251Z [WARN] agent: if auto_encrypt.allow_tls is turned on, either verify_incoming or verify_incoming_rpc should be enabled. It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2021-07-15T18:36:59.350Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 5 servers
2021-07-15T18:36:59.350Z [WARN] agent.auto_config: if auto_encrypt.allow_tls is turned on, either verify_incoming or verify_incoming_rpc should be enabled. It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2021-07-15T18:36:59.376Z [INFO] agent.server.gateway_locator: will dial the primary datacenter using our local mesh gateways if possible
2021-07-15T18:36:59.417Z [INFO] agent.server.raft: initial configuration: index=0 servers=[]
2021-07-15T18:36:59.417Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.200.72.45:8300 [Follower]" leader=
2021-07-15T18:36:59.418Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-0.secondary 10.200.72.45
2021-07-15T18:36:59.419Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: consul-server-0 10.200.72.45
2021-07-15T18:36:59.419Z [INFO] agent.router: Initializing LAN area manager
2021-07-15T18:36:59.419Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.200.72.45:8300 0 consul-server-0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.72.45:8300: operation was canceled". Reconnecting...
2021-07-15T18:36:59.419Z [INFO] agent.server: Adding LAN server: server="consul-server-0 (Addr: tcp/10.200.72.45:8300) (DC: secondary)"
2021-07-15T18:36:59.419Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=udp
2021-07-15T18:36:59.419Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-0.secondary area=wan
2021-07-15T18:36:59.420Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.200.72.45:8300 0 consul-server-0.secondary <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.72.45:8300: operation was canceled". Reconnecting...
2021-07-15T18:36:59.420Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2021-07-15T18:36:59.420Z [INFO] agent: Starting server: address=0.0.0.0:8501 network=tcp protocol=https
2021-07-15T18:36:59.420Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-07-15T18:36:59.420Z [INFO] agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-07-15T18:36:59.420Z [INFO] agent: Joining cluster...: cluster=LAN
2021-07-15T18:36:59.420Z [INFO] agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.consul.svc:8301, consul-server-1.consul-server.consul.svc:8301, consul-server-2.consul-server.consul.svc:8301, consul-server-3.consul-server.consul.svc:8301, consul-server-4.consul-server.consul.svc:8301]
2021-07-15T18:36:59.420Z [INFO] agent: started state syncer
2021-07-15T18:36:59.421Z [INFO] agent: Consul agent running!
2021-07-15T18:36:59.421Z [INFO] agent: Refreshing mesh gateways is supported for the following discovery methods: discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-07-15T18:36:59.421Z [INFO] agent: Refreshing mesh gateways...
2021-07-15T18:36:59.421Z [INFO] agent.server.gateway_locator: updated fallback list of primary mesh gateways: mesh_gateways=[10.61.69.111:443]
2021-07-15T18:36:59.421Z [INFO] agent: Refreshing mesh gateways completed
2021-07-15T18:36:59.421Z [INFO] agent: Retry join is supported for the following discovery methods: cluster=WAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2021-07-15T18:36:59.421Z [INFO] agent: Joining cluster...: cluster=WAN
2021-07-15T18:36:59.421Z [INFO] agent: (WAN) joining: wan_addresses=[*.primary/192.0.2.2]
2021-07-15T18:36:59.421Z [WARN] agent: (WAN) couldn't join: number_of_nodes=0 error="1 error occurred:
* Failed to join 192.0.2.2: Remote DC has no server currently reachable
"
2021-07-15T18:36:59.421Z [WARN] agent: Join cluster failed, will retry: cluster=WAN retry_interval=30s error=<nil>
2021-07-15T18:36:59.466Z [WARN] agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-0.consul-server.consul.svc:8301: lookup consul-server-0.consul-server.consul.svc on 10.100.200.10:53: no such host
2021-07-15T18:36:59.498Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: consul-server-3 10.200.82.70
2021-07-15T18:36:59.499Z [INFO] agent.server: Adding LAN server: server="consul-server-3 (Addr: tcp/10.200.82.70:8300) (DC: secondary)"
2021-07-15T18:36:59.499Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: consul-server-4 10.200.91.50
2021-07-15T18:36:59.499Z [INFO] agent.server: Adding LAN server: server="consul-server-4 (Addr: tcp/10.200.91.50:8300) (DC: secondary)"
2021-07-15T18:36:59.499Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: consul-server-2 10.200.23.52
2021-07-15T18:36:59.500Z [INFO] agent.server: Adding LAN server: server="consul-server-2 (Addr: tcp/10.200.23.52:8300) (DC: secondary)"
2021-07-15T18:36:59.500Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: consul-server-1 10.200.27.53
2021-07-15T18:36:59.500Z [INFO] agent.server: Adding LAN server: server="consul-server-1 (Addr: tcp/10.200.27.53:8300) (DC: secondary)"
2021-07-15T18:36:59.501Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-1.secondary 10.200.27.53
2021-07-15T18:36:59.502Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-2.secondary 10.200.23.52
2021-07-15T18:36:59.502Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-1.secondary area=wan
2021-07-15T18:36:59.502Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-4.secondary 10.200.91.50
2021-07-15T18:36:59.502Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-2.secondary area=wan
2021-07-15T18:36:59.502Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-3.secondary 10.200.82.70
2021-07-15T18:36:59.502Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-4.secondary area=wan
2021-07-15T18:36:59.503Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-3.secondary area=wan
2021-07-15T18:36:59.507Z [INFO] agent.server: Found expected number of peers, attempting bootstrap: peers=10.200.23.52:8300,10.200.27.53:8300,10.200.72.45:8300,10.200.82.70:8300,10.200.91.50:8300
2021-07-15T18:36:59.597Z [INFO] agent: (LAN) joined: number_of_nodes=4
2021-07-15T18:36:59.597Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=4
2021-07-15T18:37:00.320Z [WARN] agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=primary method=ACL.GetPolicy
2021-07-15T18:37:00.321Z [INFO] agent.server: New leader elected: payload=consul-server-1
2021-07-15T18:37:00.321Z [WARN] agent: Node info update blocked by ACLs: node=102b54b4-e8c5-e9a0-0fa4-c9e96252ff84 accessorID=legacy-token
2021-07-15T18:37:00.459Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:01.461Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:02.462Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:03.105Z [WARN] agent: Node info update blocked by ACLs: node=102b54b4-e8c5-e9a0-0fa4-c9e96252ff84 accessorID=legacy-token
2021-07-15T18:37:03.464Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:04.466Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:05.467Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:06.468Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:07.471Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:08.472Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:08.871Z [ERROR] agent: Failed to check for updates: error="Get "https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=8bbe665c-08b6-09cb-005f-5930e2ed4b87&version=1.10.0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-07-15T18:37:09.474Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:10.475Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:11.477Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:12.478Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:12.885Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-0.primary 10.200.67.28
2021-07-15T18:37:12.885Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-4.primary 10.200.93.40
2021-07-15T18:37:12.885Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-0.primary area=wan
2021-07-15T18:37:12.885Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-3.primary 10.200.37.38
2021-07-15T18:37:12.886Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-4.primary area=wan
2021-07-15T18:37:12.886Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-3.primary area=wan
2021-07-15T18:37:12.886Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-1.primary 10.200.41.47
2021-07-15T18:37:12.886Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-2.primary 10.200.65.46
2021-07-15T18:37:12.886Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-1.primary area=wan
2021-07-15T18:37:12.886Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-2.primary area=wan
2021-07-15T18:37:13.007Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.65.46:8302: read tcp 10.200.72.45:48552->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:13.480Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:13.509Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.37.38:8302: read tcp 10.200.72.45:48564->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:13.596Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.67.28:8302: read tcp 10.200.72.45:48566->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:14.007Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.93.40:8302: read tcp 10.200.72.45:48572->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:14.481Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:14.507Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.41.47:8302: read tcp 10.200.72.45:48582->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:14.594Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.65.46:8302: read tcp 10.200.72.45:48584->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:15.007Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.41.47:8302: read tcp 10.200.72.45:48586->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:15.096Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.37.38:8302: read tcp 10.200.72.45:48588->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:15.483Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:15.508Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.37.38:8302: read tcp 10.200.72.45:48594->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:16.093Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=legacy-token
2021-07-15T18:37:16.484Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:17.486Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:18.487Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
2021-07-15T18:37:19.006Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.65.46:8302: read tcp 10.200.72.45:48608->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:19.094Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.93.40:8302: read tcp 10.200.72.45:48610->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:19.183Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.200.41.47:8302: read tcp 10.200.72.45:48612->10.61.69.111:443: read: connection reset by peer
2021-07-15T18:37:19.489Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/self from=10.200.27.52:37474 error="Permission denied"
After the above it repeats.
Looking at the above logs this looks suspect
2021-07-15T18:36:59.421Z [INFO] agent: Joining cluster...: cluster=WAN
2021-07-15T18:36:59.421Z [INFO] agent: (WAN) joining: wan_addresses=[*.primary/192.0.2.2]
2021-07-15T18:36:59.421Z [WARN] agent: (WAN) couldn't join: number_of_nodes=0 error="1 error occurred:
* Failed to join 192.0.2.2: Remote DC has no server currently reachable
Checking connectivity between the two regions. If that job relies on the servers being up and replicating acl then that makes sense as to the failure.
Though I do see the same output in my other two clusters running 1.8.10 hmm...
Aha and on the primary I see.
[WARN] agent.server.rpc; RPC request to DC is currently failing as no server can be reached: datacenter=secondary
Maybe remove the BUG tag until some more digging is done. This may be a connectivity issue. I will try this exact install in two new clusters and report.
hmm nc
verified connectivity between the two Kubernetes clusters.
More digging to do to see if I can uncover any more details.
The sequence is:
So at step 1, check if the secondary server pods can connect with the primary's mesh gateways.
I have connected with nc
, verified the route to the nodes in the primary from the secondary with traceroute
.
Going to try a few more things to obtain more data.
The sequence above is super helpful. I am certainly facing some sort of network issue between the clusters as the connectivity was able to be established.
The helm chart succeeded when I logged into the init job container and established a connection with the mesh gateway via NC then it all went as I would expect. I am closing this issue as it is verified to be a network problem and not a bug with the helm chart.
Overview of the Issue
Installation of Primary datacenter went well and by the book and was installed with the following chart CHART: consul-0.32.1 CONSUL: 1.10.0
Installation of Secondary datacenter failed to execute the
consul-server-acl-init
job in the secondary datacenter with federation enabled. CHART: consul-0.32.1 CONSUL: 1.10.0Reproduction Steps
1. Installation of Consul in the Primary DC
helm install consul -n consul -f $BELOW_VALUES hashicorp/consul --set global.name=consul
2. Create the Proxy Default in Primary
kubectl apply -n consul -f $BELOW_YAML
3. Export
consul-federation
from Primary and Import to Secondarykubectl get secret -n consul consul-federation -oyaml > secret.consul-federation.yaml
kucbectl config use-context secondary
kubectl apply -n consul -f secret.consul-federation.yaml
4. Installation of Consul in Secondary DC
helm install consul -n consul -f $BELOW_VALUES hashicorp/consul --set global.name=consul
Expected behavior
I expect the secondary datacenter to come online but federated with the primary.
Environment details
consul-k8s
version: 0.26.0consul-helm
version: 0.32.1values.yaml
used to deploy the helm chart: above for both primary and secondaryAdditional Context
kubectl get pods -n consul
Additionally, you can view the logs from the acl init job after the servers successfully came online.
kubectl logs -n consul consul-server-acl-init-2k4z8
Since this job never successfully completes the secondary region cannot come online due to missing secrets such as the client token, mesh gateway token etc...