hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.39k stars 4.43k forks source link

Dead consul node registered in service resulting in traffic routed to dead container. #19689

Closed donaltuohy closed 9 months ago

donaltuohy commented 11 months ago

Overview of the Issue

We use Consul for service discovery for our services on AWS ECS. We have a consul cluster with 5 nodes in EC2. Each ECS task we spin up has a main container with business logic and a consul sidecar which has the main service registered on startup. We then have a router with consul sidecar using consul DNS for looking up these service. When the task starts up, it successfully registers with consul for dns lookups and then when terminated, it successfully removes itself from the cluster. It has been working well so far but we have encountered a very strange issue with ghost nodes being registered with services. It happens very sporadically and we haven't being able to replicate it on demand.

When it does happen, a portion of requests are routed to a dead container as the DNS query is returning the IP of the node. We are not sure what triggers it but I will provide what we have observed when it does happen.

All of our services have two task so there is two consul sidecar nodes registered for each service. When this issue happens, there are 3 nodes registered like here:

service catalog result `curl http://127.0.0.1:8500/v1/catalog/service/synthetic-app-123 | jq .` ```json [ { "ID": "002769a4-514d-8492-bdcb-b64e927977d4", "Node": "synthetic-app-123-5144440f3efd4f2c", "Address": "10.1.175.70", "Datacenter": "us-east-1", "TaggedAddresses": { "lan": "10.1.175.70", "lan_ipv4": "10.1.175.70", "wan": "10.1.175.70", "wan_ipv4": "10.1.175.70" }, "NodeMeta": { "consul-network-segment": "" }, "ServiceKind": "", "ServiceID": "synthetic-app-123", "ServiceName": "synthetic-app-123", "ServiceTags": [ "synthetic-app-123" ], "ServiceAddress": "", "ServiceWeights": { "Passing": 1, "Warning": 1 }, "ServiceMeta": {}, "ServicePort": 443, "ServiceSocketPath": "", "ServiceEnableTagOverride": false, "ServiceProxy": { "Mode": "", "MeshGateway": {}, "Expose": {} }, "ServiceConnect": {}, "CreateIndex": 715741, "ModifyIndex": 715741 }, { "ID": "b10b81b9-c26c-fa72-91d5-c917175f69e4", "Node": "synthetic-app-123-5c2be7c522ac4c77", "Address": "10.1.159.202", "Datacenter": "us-east-1", "TaggedAddresses": { "lan": "10.1.159.202", "lan_ipv4": "10.1.159.202", "wan": "10.1.159.202", "wan_ipv4": "10.1.159.202" }, "NodeMeta": { "consul-network-segment": "" }, "ServiceKind": "", "ServiceID": "synthetic-app-123", "ServiceName": "synthetic-app-123", "ServiceTags": [ "synthetic-app-123" ], "ServiceAddress": "", "ServiceWeights": { "Passing": 1, "Warning": 1 }, "ServiceMeta": {}, "ServicePort": 443, "ServiceSocketPath": "", "ServiceEnableTagOverride": false, "ServiceProxy": { "Mode": "", "MeshGateway": {}, "Expose": {} }, "ServiceConnect": {}, "CreateIndex": 771968, "ModifyIndex": 771968 }, { "ID": "ed580aad-8562-2347-0113-5c252eb68d70", "Node": "synthetic-app-123-6f27a5a8f36e4185", "Address": "10.1.190.50", "Datacenter": "us-east-1", "TaggedAddresses": { "lan": "10.1.190.50", "lan_ipv4": "10.1.190.50", "wan": "10.1.190.50", "wan_ipv4": "10.1.190.50" }, "NodeMeta": { "consul-network-segment": "" }, "ServiceKind": "", "ServiceID": "synthetic-app-123", "ServiceName": "synthetic-app-123", "ServiceTags": [ "synthetic-app-123" ], "ServiceAddress": "", "ServiceWeights": { "Passing": 1, "Warning": 1 }, "ServiceMeta": {}, "ServicePort": 443, "ServiceSocketPath": "", "ServiceEnableTagOverride": false, "ServiceProxy": { "Mode": "", "MeshGateway": {}, "Expose": {} }, "ServiceConnect": {}, "CreateIndex": 772011, "ModifyIndex": 772011 } ] ```

When looking at the health of these nodes we notice that the unhealthy node has no serf health check but it is saying that the process check on the main container was successful (not true as that container isn't running anymore).

ghost node check status - Note: No serf check `curl http://127.0.0.1:8500/v1/health/node/synthetic-app-123-5144440f3efd4f2c | jq .` ```json [ { "Node": "synthetic-app-123-5144440f3efd4f2c", "CheckID": "process-check", "Name": "Cage Process Check", "Status": "passing", "Notes": "", "Output": "HTTP GET http://localhost:3032: 200 OK Output: {\"controlPlane\":{\"status\":\"Ok\",\"message\":\"Control plane is running\"},\"dataPlane\":{\"status\":\"Ok\",\"message\":null}}", "ServiceID": "synthetic-app-123 ", "ServiceName": "synthetic-app-123", "ServiceTags": [ "synthetic-app-123-5144440f3efd4f2c" ], "Type": "http", "Interval": "3s", "Timeout": "1s", "ExposedPort": 0, "Definition": {}, "CreateIndex": 715741, "ModifyIndex": 715741 } ] ```
Normal node check status - Note: has a serf check `curl http://127.0.0.1:8500/v1/health/node/synthetic-app-123-5c2be7c522ac4c77 | jq .` ```json [ { "Node": "synthetic-app-123-5c2be7c522ac4c77", "CheckID": "process-check", "Name": "Cage Process Check", "Status": "passing", "Notes": "", "Output": "HTTP GET http://localhost:3032: 200 OK Output: {\"controlPlane\":{\"status\":\"Ok\",\"message\":\"Control plane is running\"},\"dataPlane\":{\"status\":\"Ok\",\"message\":null}}", "ServiceID": "synthetic-app-123", "ServiceName": "synthetic-app-123", "ServiceTags": [ "synthetic-app-123" ], "Type": "http", "Interval": "3s", "Timeout": "1s", "ExposedPort": 0, "Definition": {}, "CreateIndex": 772280, "ModifyIndex": 772291 }, { "Node": "synthetic-app-123-5c2be7c522ac4c77", "CheckID": "serfHealth", "Name": "Serf Health Status", "Status": "passing", "Notes": "", "Output": "Agent alive and reachable", "ServiceID": "", "ServiceName": "", "ServiceTags": [], "Type": "", "Interval": "", "Timeout": "", "ExposedPort": 0, "Definition": {}, "CreateIndex": 772278, "ModifyIndex": 772278 } ] ```

Reproduction Steps

We have not being able to reproduce the issue, it just occurs randomly in our cluster.

Consul info for both Client and Server

Consul Server info ``` h-4.2$ consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = b7885549 version = 1.15.6 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.1.204.37:8300 server = true raft: applied_index = 772536 commit_index = 772536 fsm_pending = 0 last_contact = 0 last_log_index = 772536 last_log_term = 2 last_snapshot_index = 770252 last_snapshot_term = 2 latest_configuration = [{Suffrage:Voter ID:e48e183f-57dc-5f0c-c7b3-c9e62552a514 Address:10.1.204.37:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 204 max_procs = 2 os = linux version = go1.20.8 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 24632 members = 9 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ```
Consul Server config file ```json { "server": true, "datacenter": "us-east-1", "data_dir": "/var/consul", "log_level": "INFO", "log_file": "var/log/consul/", "log_rotate_max_files": 5, "log_json": true, "enable_syslog": true, "client_addr": "0.0.0.0", "bootstrap_expect": 1, "advertise_addr": "10.1.204.37", "addresses": { "https": "0.0.0.0", "dns":"0.0.0.0" }, "encrypt": "[key]", "ui_config": { "enabled": true }, "dns_config" : { "a_record_limit": 1 }, "retry_join": ["provider=aws tag_key=Name tag_value=consul-server"] } ```
Consul agent info ``` agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 1 build: prerelease = revision = b7885549 version = 1.15.6 version_metadata = consul: acl = disabled known_servers = 1 server = false runtime: arch = amd64 cpu_count = 2 goroutines = 48 max_procs = 2 os = linux version = go1.20.8 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 24638 members = 9 query_queue = 0 query_time = 1 ```
Consul agent config file ```json { "advertise_reconnect_timeout": "2s", "data_dir": "/var/consul/", "datacenter": "us-east-1", "dns_config": { "a_record_limit": 1 }, "domain": "consul", "enable_central_service_config": true, "enable_script_checks": false, "encrypt": "[key]", "log_level": "INFO", "node_name": "synthetic-app-123-5c2be7c522ac4c77", "ports": { "dns": 8600, "http": 8500, "https": 8443 }, "retry_join": [ "provider=aws tag_key=Name tag_value=consul" ], "retry_max": 2, "server": false, "service": [ { "check": { "deregister_critical_service_after": "1s", "header": { "User-Agent": [ "ECS-HealthCheck" ] }, "http": "http://localhost:3032", "id": "process-check", "interval": "3s", "method": "GET", "name": "Cage Process Check", "success_before_passing": 4, "timeout": "1s", "tls_skip_verify": true }, "id": "synthetic-app-123", "name": "synthetic-app-123", "port": 443, "tags": [ "synthetic-app-123" ] } ] } ```

Operating system and Environment details

The consul servers are as a systemd process on Amazon Linux 2 EC2s instances. The agents are running as containers on ECS backed by Amazon Linux 2 EC2 instances.

Log Fragments

Ghost node logs when it first deregistered (Successfully shutdown) ``` 2023-11-17T13:11:24.915Z [INFO] agent: Caught: signal=terminated 2023-11-17T13:11:24.915Z [INFO] agent: Gracefully shutting down agent... 2023-11-17T13:11:24.915Z [INFO] agent.client: client starting leave 2023-11-17T13:11:25.126Z [INFO] agent.client.serf.lan: serf: EventMemberLeave: synthetic-app-123-5144440f3efd4f2c 10.1.175.70 2023-11-17T13:11:28.245Z [INFO] agent: Synced node info 2023-11-17T13:11:28.247Z [INFO] agent: Synced service: service=synthetic-app-123 2023-11-17T13:11:28.726Z [INFO] agent: Graceful exit completed 2023-11-17T13:11:28.726Z [INFO] agent: Requesting shutdown 2023-11-17T13:11:28.726Z [INFO] agent.client: shutting down client 2023-11-17T13:11:28.730Z [INFO] agent: consul client down 2023-11-17T13:11:28.730Z [INFO] agent: shutdown complete 2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=tcp 2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=udp 2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: address=127.0.0.1:8443 network=tcp protocol=https 2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: address=127.0.0.1:8500 network=tcp protocol=http 2023-11-17T13:11:28.730Z [INFO] agent: Waiting for endpoints to shut down 2023-11-17T13:11:28.730Z [INFO] agent: Endpoints down 2023-11-17T13:11:28.730Z [INFO] agent: Exit code: code=0 ```

Our initial thoughts were that the reused IPs from ECS were causing the nodes to be resurrected when another container has the IP of a previously registered node:

eg:

Could this be why the the issue is occuring?

huikang commented 11 months ago

@donaltuohy , I am wondering what the DNS hostname is resolved after node-2 starts and gets the IP-A as it is now free. If it is resolved to IP-A, it is highly possible that requests to service-A could be directed to IP-A.

donaltuohy commented 11 months ago

@huikang thanks for looking! Just to be clear, we haven't confirmed that flow with node 1 and node 2 is happening, it was just a suspicion as to what might be happening. We have "a_record_limit": 1 set to one in our dns config so consul is shuffling the DNS records but yest the dead node's IP is being resolved along with the two nodes that are actually running and healthy.


$ dig @localhost -p 8600 synthetic-app-123-5144440f3efd4f2c.service.consul +short
10.1.168.115  # HEALTHY NODE 1

$ dig @localhost -p 8600 synthetic-app-123-5144440f3efd4f2c.service.consul +short
10.1.175.70 # DEAD NODE

$ dig @localhost -p 8600 synthetic-app-123-5144440f3efd4f2c.service.consul +short
10.1.142.58 # HEALTHY NODE 2

This is the main issue for us, our router get the dead node IP and then routes traffic to it thinking it's fine.

donaltuohy commented 9 months ago

Just writing back here to say that the issue appears to have been solved by updating to consul 1.17.0. Issue was discussed on the hashicorp forum, in this thread: https://discuss.hashicorp.com/t/ghost-agent-node-no-serf-check-but-still-registered/60432/3