Open ksandrmatveyev opened 1 year ago
Consul agents learn about other members in the cluster via gossip. You mentioned that the server agents exist on the same VPC. As such, they are likely discovering information about the other cluster and sharing it with client agents in the other VPCs.
An easy way to protect against this is to configure a unique gossip encryption key for each cluster. You can set this key using the consul agent
's -encrypt-
argument or the corresponding encrypt
argument in the agent configuration file.
Ideally, you'd also segment the servers for cluster A and B onto separate VPCs, or at least prevent gossip communication between those nodes on TCP/UDP port 8301.
Thanks for explanation @blake . I wonder if that limitation (encrypt
parameter must be set if you have multiple consul servers in one LAN as well as having one cluster per LAN/VPC) should be added to docs? I'm still curios about retry_join
as from my initial understanding forced the communications between consul clients and servers (that were found by cloud auto-join)
Anyway, I followed by enable-gossip-encryption-existing-cluster and:
Overview of the Issue
Hello. We have multiple Consul clusters with ACL in the same AWS VPC for the same AWS account but logically splitted (e.g. A, B and etc). Cloud auto-join is used (server config):
Server config (A)
```json { "datacenter": "eu-west-1", "primary_datacenter": "eu-west-1" "retry_join": ["provider=aws tag_key=consul tag_value=taga region=eu-west-1"] "acl": { "enabled": true, "default_policy": "deny", "enable_token_persistence": true, "tokens": { "master": "dummy_master_token_dev" } } } ```Server config (B)
```json { "datacenter": "eu-west-1", "primary_datacenter": "eu-west-1" "retry_join": ["provider=aws tag_key=consul tag_value=tagb region=eu-west-1"], "acl": { "enabled": true, "default_policy": "deny", "enable_token_persistence": true, "tokens": { "master": "dummy_master_token_qa" } } } ```We were starting to register other nodes and services to own Consul clusters (A agent nodes to A, B to B and etc) with consul agent:
Client config (A)
```json { "node_name": "dummy_name_a_agent", "client_addr": "127.0.0.1", "advertise_addr": "dummy_private_ip_a", "datacenter": "eu-west-1", "retry_join": ["provider=aws tag_key=consul tag_value=taga region=eu-west-1"], "leave_on_terminate": true, "acl": { "tokens": { "agent": "dummy_agent_token" } } ```Client config (B)
```json { "node_name": "dummy_name_b_agent", "client_addr": "127.0.0.1", "advertise_addr": "dummy_private_ip_b", "datacenter": "eu-west-1", "retry_join": ["provider=aws tag_key=consul tag_value=tagb region=eu-west-1"], "leave_on_terminate": true, "acl": { "tokens": { "agent": "dummy_agent_token" } } ```It was successful. But we found that when we shutdown (stop EC2 instances) of consul server nodes A for a night, and starting them up at the morning, these nodes are trying to join Consul cluster B (even if master tokens and
retry_join
are different)It is not the case, if consul cluster nodes are provisioned from scratch (nodes are joined as expected: A server nodes to A, B to B). No issues with AWS EC2 API this time as from response of AWS Support
We are thinking of affect of consul agent nodes and service registration, but haven't find any useful information in logs.
Expected behavior
Consul server and agent nodes must join to only nodes that are defined in
retry_join
Reproduction Steps
Steps to reproduce this issue, e.g.:
Consul info for both Client and Server
Server info (A)
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = true leader_addr = dummy_a_ip1:8300 server = true raft: applied_index = 284716259 commit_index = 284716259 fsm_pending = 0 last_contact = 0 last_log_index = 284716259 last_log_term = 2 last_snapshot_index = 284700030 last_snapshot_term = 2 latest_configuration = [{Suffrage:Voter ID:896c3f11-7a69-be8e-869a-e212a5971696 Address:dummy_a_ip1:8300} {Suffrage:Voter ID:b13f8198-702c-031e-fa46-5777ef8e0bb9 Address:dummy_a_ip2:8300} {Suffrage:Voter ID:3159cd11-b2a0-40e9-821c-0fbf3e0e6558 Address:dummy_a_ip3:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 1 goroutines = 237 max_procs = 1 os = linux version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 6 member_time = 255598 members = 33 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 5 members = 3 query_queue = 0 query_time = 1 ```Client info (A)
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 10 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 2 goroutines = 54 max_procs = 2 os = windows version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 6 member_time = 255625 members = 33 query_queue = 0 query_time = 1 ```Server info (B)
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = dummy_b_ip1:8300 server = true raft: applied_index = 290647230 commit_index = 290647230 fsm_pending = 0 last_contact = 77.975468ms last_log_index = 290647230 last_log_term = 30764 last_snapshot_index = 290646290 last_snapshot_term = 30764 latest_configuration = [{Suffrage:Voter ID:6c123413-4a7d-b9e8-61da-86b1fb9923c3 Address:dummy_b_ip1:8300} {Suffrage:Voter ID:d9c77dd7-95ec-3a7a-2d22-35c197729101 Address:dummy_b_ip2:8300} {Suffrage:Voter ID:d0b7b2c2-13d0-9a90-256f-f5dafd306ecd Address:dummy_b_ip3:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 30764 runtime: arch = amd64 cpu_count = 1 goroutines = 399 max_procs = 1 os = linux version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 447 failed = 0 health_score = 0 intent_queue = 0 left = 37 member_time = 255991 members = 106 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 199657 members = 3 query_queue = 0 query_time = 1 ```Client info (B)
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 39 build: prerelease = revision = 56171a4e version = 1.10.8 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 57 max_procs = 4 os = windows version = go1.16.12 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 447 failed = 0 health_score = 0 intent_queue = 0 left = 37 member_time = 256008 members = 106 query_queue = 0 query_time = 1 ```Operating system and Environment details
Servers - Linux, amd64, consul server are running as container and managed by systemd, same AWS VPC Agents - Windows Server 2019, runs as Windows Service, different subnets and VPCs
Log Fragments
error messages on cluster B about that A server nodes are trying to connect:
error messages on nodes of cluster A: