Server nodes of one cluster are joined to other cluster after stop/start

ksandrmatveyev commented 1 year ago

Overview of the Issue

Hello. We have multiple Consul clusters with ACL in the same AWS VPC for the same AWS account but logically splitted (e.g. A, B and etc). Cloud auto-join is used (server config):

Server config (A)

```json { "datacenter": "eu-west-1", "primary_datacenter": "eu-west-1" "retry_join": ["provider=aws tag_key=consul tag_value=taga region=eu-west-1"] "acl": { "enabled": true, "default_policy": "deny", "enable_token_persistence": true, "tokens": { "master": "dummy_master_token_dev" } } } ```

Server config (B)

```json { "datacenter": "eu-west-1", "primary_datacenter": "eu-west-1" "retry_join": ["provider=aws tag_key=consul tag_value=tagb region=eu-west-1"], "acl": { "enabled": true, "default_policy": "deny", "enable_token_persistence": true, "tokens": { "master": "dummy_master_token_qa" } } } ```

We were starting to register other nodes and services to own Consul clusters (A agent nodes to A, B to B and etc) with consul agent:

Client config (A)

```json { "node_name": "dummy_name_a_agent", "client_addr": "127.0.0.1", "advertise_addr": "dummy_private_ip_a", "datacenter": "eu-west-1", "retry_join": ["provider=aws tag_key=consul tag_value=taga region=eu-west-1"], "leave_on_terminate": true, "acl": { "tokens": { "agent": "dummy_agent_token" } } ```

Client config (B)

```json { "node_name": "dummy_name_b_agent", "client_addr": "127.0.0.1", "advertise_addr": "dummy_private_ip_b", "datacenter": "eu-west-1", "retry_join": ["provider=aws tag_key=consul tag_value=tagb region=eu-west-1"], "leave_on_terminate": true, "acl": { "tokens": { "agent": "dummy_agent_token" } } ```

It was successful. But we found that when we shutdown (stop EC2 instances) of consul server nodes A for a night, and starting them up at the morning, these nodes are trying to join Consul cluster B (even if master tokens and retry_join are different)

It is not the case, if consul cluster nodes are provisioned from scratch (nodes are joined as expected: A server nodes to A, B to B). No issues with AWS EC2 API this time as from response of AWS Support

We are thinking of affect of consul agent nodes and service registration, but haven't find any useful information in logs.

Expected behavior

Consul server and agent nodes must join to only nodes that are defined in retry_join

Reproduction Steps

Steps to reproduce this issue, e.g.:

Create a cluster A with 3 server nodes with cloud auto-join
Create a cluster B with 3 server nodes with cloud auto-join
Join 1+ consul agent nodes to cluster A
Join other 1+ consul agent nodes to cluster B
Stop server nodes in cluster A (stop EC2 instances)
Start server nodes in cluster B (start EC2 instances)

Got error messages on cluster B about nodes that try to connect:

[INFO]  agent.server: Handled event for server in area: event=member-join server=dummy-hostname-node-cluster-a.eu-west-1 area=wan

Got error messages on nodes of cluster A:
```
[ERROR] ACL not found
```

Consul info for both Client and Server

Server info (A)

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = true leader_addr = dummy_a_ip1:8300 server = true raft: applied_index = 284716259 commit_index = 284716259 fsm_pending = 0 last_contact = 0 last_log_index = 284716259 last_log_term = 2 last_snapshot_index = 284700030 last_snapshot_term = 2 latest_configuration = [{Suffrage:Voter ID:896c3f11-7a69-be8e-869a-e212a5971696 Address:dummy_a_ip1:8300} {Suffrage:Voter ID:b13f8198-702c-031e-fa46-5777ef8e0bb9 Address:dummy_a_ip2:8300} {Suffrage:Voter ID:3159cd11-b2a0-40e9-821c-0fbf3e0e6558 Address:dummy_a_ip3:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 1 goroutines = 237 max_procs = 1 os = linux version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 6 member_time = 255598 members = 33 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 5 members = 3 query_queue = 0 query_time = 1 ```

Client info (A)

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 10 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 2 goroutines = 54 max_procs = 2 os = windows version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 6 member_time = 255625 members = 33 query_queue = 0 query_time = 1 ```

Server info (B)

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = dummy_b_ip1:8300 server = true raft: applied_index = 290647230 commit_index = 290647230 fsm_pending = 0 last_contact = 77.975468ms last_log_index = 290647230 last_log_term = 30764 last_snapshot_index = 290646290 last_snapshot_term = 30764 latest_configuration = [{Suffrage:Voter ID:6c123413-4a7d-b9e8-61da-86b1fb9923c3 Address:dummy_b_ip1:8300} {Suffrage:Voter ID:d9c77dd7-95ec-3a7a-2d22-35c197729101 Address:dummy_b_ip2:8300} {Suffrage:Voter ID:d0b7b2c2-13d0-9a90-256f-f5dafd306ecd Address:dummy_b_ip3:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 30764 runtime: arch = amd64 cpu_count = 1 goroutines = 399 max_procs = 1 os = linux version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 447 failed = 0 health_score = 0 intent_queue = 0 left = 37 member_time = 255991 members = 106 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 199657 members = 3 query_queue = 0 query_time = 1 ```

Client info (B)

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 39 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 57 max_procs = 4 os = windows version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 447 failed = 0 health_score = 0 intent_queue = 0 left = 37 member_time = 256008 members = 106 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Servers - Linux, amd64, consul server are running as container and managed by systemd, same AWS VPC Agents - Windows Server 2019, runs as Windows Service, different subnets and VPCs

Log Fragments

error messages on cluster B about that A server nodes are trying to connect:

[INFO]  agent.server: Handled event for server in area: event=member-join server=dummy-hostname-node-cluster-a.eu-west-1 area=wan

error messages on nodes of cluster A:

[ERROR] ACL not found

blake commented 1 year ago

Consul agents learn about other members in the cluster via gossip. You mentioned that the server agents exist on the same VPC. As such, they are likely discovering information about the other cluster and sharing it with client agents in the other VPCs.

An easy way to protect against this is to configure a unique gossip encryption key for each cluster. You can set this key using the consul agent's -encrypt- argument or the corresponding encrypt argument in the agent configuration file.

Ideally, you'd also segment the servers for cluster A and B onto separate VPCs, or at least prevent gossip communication between those nodes on TCP/UDP port 8301.

ksandrmatveyev commented 1 year ago

Thanks for explanation @blake . I wonder if that limitation (encrypt parameter must be set if you have multiple consul servers in one LAN as well as having one cluster per LAN/VPC) should be added to docs? I'm still curios about retry_join as from my initial understanding forced the communications between consul clients and servers (that were found by cloud auto-join)

Anyway, I followed by enable-gossip-encryption-existing-cluster and:

add encrypt for servers and restart consul services on them one-by-one. verify outgoing/incoming set to false
add encrypt for agents and restart consul services on them one-by-one. verify outgoing/incoming set to false
set verify outgoing to true for servers and restart consul services on them one-by-one
set verify outgoing to true for agents and restart consul services on them one-by-one
set verify incoming to true for servers and restart consul services on them one-by-one
set verify incoming to true for agents and restart consul services on them one-by-one

hashicorp / consul