hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

X-Consul-Index going backwards due to different results from servers #7939

Open cshabi opened 4 years ago

cshabi commented 4 years ago

Overview of the Issue

We are using blocking queries to watch for health changes in our services, applying the standard mechanism of polling with some index and using the returned X-Consul-Index header as the index for the next poll. We query the local consul agent with a query that looks like (note the stale=true): v1/health/service/ServiceName?index=&passing=true&wait=1s&stale=true&tag=environment-prod We recently found out that each server is consistently returning the same index (can be for many hours, until the watched service changes), but different servers are returning different indexes. Since we are using stale=true, this is causing the blocking queries to return X-Consul-Index that keeps changing up and down according to which consul server was queried. Is this issue expected? Is the combination of stale + blocking queries not supported?

http://Follower1:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true Vary: Accept-Encoding X-Consul-Effective-Consistency: stale X-Consul-Index: 1672207871 X-Consul-Knownleader: true X-Consul-Lastcontact: 34

http://Follower2:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true Vary: Accept-Encoding X-Consul-Effective-Consistency: stale X-Consul-Index: 1672201982 X-Consul-Knownleader: true X-Consul-Lastcontact: 7

http://Leader:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true Vary: Accept-Encoding X-Consul-Effective-Consistency: stale X-Consul-Index: 1672198845 X-Consul-Knownleader: true X-Consul-Lastcontact: 0

When we remove the stale=true parameter we always get "1672198845" which is the index returned by the leader. We also noticed that for the problematic services in this state, the ModifiedIndex and CreateIndex are the same and identical to the value returned in the header (then setting stale=true)

Consul info Follower1

Follower1 ``` agent: check_monitors = 2 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 39f93f01 version = 1.2.1 consul: bootstrap = false known_datacenters = 4 leader = false leader_addr = :8300 server = true raft: applied_index = 1677031174 commit_index = 1677031174 fsm_pending = 0 last_contact = 61.142668ms last_log_index = 1677031175 last_log_term = 359458 last_snapshot_index = 1677031112 last_snapshot_term = 359458 latest_configuration = [{Suffrage:Voter ID:dde09378-4944-7018-9711-6be17deb00f3 Address::8300} {Suffrage:Voter ID:43b09dd3-b385-3b71-b4c5-9ef03df42ac6 Address::8300} {Suffrage:Voter ID:cfdd42a9-44e5-590a-1a50-e5e79935b30a Address::8300}] latest_configuration_index = 1669000507 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 359458 runtime: arch = amd64 cpu_count = 96 goroutines = 86511 max_procs = 96 os = linux version = go1.10.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 154693 failed = 2 health_score = 0 intent_queue = 0 left = 7 member_time = 241034 members = 2684 query_queue = 0 query_time = 1294 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2042 members = 10 query_queue = 0 query_time = 1 ```

Consul info Follower2

Follower2 ``` agent: check_monitors = 2 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 39f93f01 version = 1.2.1 consul: bootstrap = false known_datacenters = 4 leader = false leader_addr = :8300 server = true raft: applied_index = 1677031419 commit_index = 1677031419 fsm_pending = 0 last_contact = 14.931091ms last_log_index = 1677031419 last_log_term = 359458 last_snapshot_index = 1677016577 last_snapshot_term = 359458 latest_configuration = [{Suffrage:Voter ID:dde09378-4944-7018-9711-6be17deb00f3 Address::8300} {Suffrage:Voter ID:43b09dd3-b385-3b71-b4c5-9ef03df42ac6 Address::8300} {Suffrage:Voter ID:cfdd42a9-44e5-590a-1a50-e5e79935b30a Address::8300}] latest_configuration_index = 1669000507 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 359458 runtime: arch = amd64 cpu_count = 96 goroutines = 90189 max_procs = 96 os = linux version = go1.10.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 154693 failed = 2 health_score = 0 intent_queue = 0 left = 7 member_time = 241034 members = 2684 query_queue = 0 query_time = 1294 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2042 members = 10 query_queue = 0 query_time = 1 ```

Consul info Leader

Leader info ``` agent: check_monitors = 2 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 39f93f01 version = 1.2.1 consul: bootstrap = false known_datacenters = 4 leader = true leader_addr = :8300 server = true raft: applied_index = 1677031859 commit_index = 1677031859 fsm_pending = 0 last_contact = 0 last_log_index = 1677031859 last_log_term = 359458 last_snapshot_index = 1677020696 last_snapshot_term = 359458 latest_configuration = [{Suffrage:Voter ID:dde09378-4944-7018-9711-6be17deb00f3 Address::8300} {Suffrage:Voter ID:43b09dd3-b385-3b71-b4c5-9ef03df42ac6 Address::8300} {Suffrage:Voter ID:cfdd42a9-44e5-590a-1a50-e5e79935b30a Address::8300}] latest_configuration_index = 1669000507 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 359458 runtime: arch = amd64 cpu_count = 96 goroutines = 77999 max_procs = 96 os = linux version = go1.10.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 154693 failed = 2 health_score = 0 intent_queue = 0 left = 7 member_time = 241034 members = 2684 query_queue = 0 query_time = 1294 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2042 members = 10 query_queue = 0 query_time = 1 ```

Operating system and Environment details

All servers are running ubuntu16.o4 and kernel version: 4.15.0-99-generic Consul data dir is running on SSD, all servers have 96 cores

Most agents run on kube nodes(ubuntu16), services are running in containers and registered upon deployment using registrator. Other agents are running on variety of OS distribution(all are ubuntu flavours) but service in this case are registered using json flies.

pierresouchay commented 4 years ago

What is really weird is that followers have greater index than leader, to me, that's a kind of corruption. For each of those results, does the index returned is part of the output in fields CreateIndex or ModifyIndex (if it is not, something is really wrong and I'll delete the raft database on this machine)

We had similar this once on few of our old servers in a cluster, which lead to other issues such as https://github.com/hashicorp/consul/issues/6181

In order to fix that, for each server, 1 by 1 we did:

  1. consul leave
  2. rm -fr /path/to_consul_raft/db
  3. Rejoin cluster
  4. Once joined, do it for next server

We add this only once in our clusters (12 clusters), on 2 servers in same cluster, and since those servers were quite old, we never went far further into investigations, but you might have encountered similar issue.

To debug this, you might use https://github.com/pierresouchay/consul-ops-tools/blob/master/bin/consul_check_services_changes.sh that helps by showing differences in output very easily

cshabi commented 4 years ago

Hi @pierresouchay, thanks for the quick response!

We replaced our consul servers a month ago (one by one) with completely new servers. Wouldn't that simulate the steps you suggested above?

We will check out the tool you added and to incorporate it into our monitoring stack

pierresouchay commented 4 years ago

@cshabi yes, indeed, this is equivalent

jsosulska commented 4 years ago

Hi @cshabi ,

Thanks for posting. I'd like to collect some additional information, if possible.

  1. Is this running in some cloud provider, like AWS, or bare metal?
  2. I saw you mention Kubernetes. Is Consul launched as a service on Kubernetes, or is it on hosts external to the Kubernetes cluster? If Consul is launched as a service on Kubernetes, how did you deploy it?
  3. Have you looked at our implementation details around blocking queries? We have some steps recommended for if the index goes backwards on that page.

Looking forward to hearing back from you!

cshabi commented 4 years ago

hi @pierresouchay

Following are the steps we did (repeated for all servers):

  1. stop consul agent on Follower1
  2. mv /path/to_consul_raft/raft.db /path/to_consul_raft/raft.db.bck 3.start consul agent on Follower1

Im adding some logs and graphs: These graphs show when we stopped each of the servers(we started them about 2 minutes after): Stopping consul Follower1:

Screen Shot 2020-06-07 at 15 22 51

Stopping consul Follower2:

Screen Shot 2020-06-07 at 15 25 04

Stopping the leader:

Screen Shot 2020-06-07 at 15 27 29

Theres are the logs for when each server rejoined the cluster (it looks like the leader provided old logs)

The return of Follower1 ``` 2020/06/07 06:49:11 [INFO] consul: Handled member-join event for server "Follower1.dc" in area "wan" 2020/06/07 06:49:11 [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration 2020/06/07 06:49:13 [WARN] raft: Rejecting vote request from :8300 since we have a leader: :8300 2020/06/07 06:49:13 [INFO] serf: EventMemberJoin: Follower1 IP 2020/06/07 06:49:13 [INFO] consul: Adding LAN server Follower1 (Addr: tcp/IP:8300) (DC: dc) 2020/06/07 06:49:13 [INFO] consul: member 'Follower1' joined, marking health alive 2020/06/07 06:49:14 [WARN] raft: AppendEntries to {Voter c56d717f-00ff-5f3d-a764-4459fc860f4f Follower1_IP:8300} rejected, sending older logs (next: 1390436201) 2020/06/07 06:49:15 [INFO] raft: pipelining replication to peer {Voter c56d717f-00ff-5f3d-a764-4459fc860f4f Follower1_IP:8300} 2020/06/07 06:49:30 [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration ```
The return of Follower2 ``` 2020/06/07 06:56:20 [INFO] serf: EventMemberUpdate: Follower2.dc 2020/06/07 06:56:20 [WARN] raft: AppendEntries to {Voter b5320274-520b-08e2-7117-6d40bf0ac936 Follower2_IP:8300} rejected, sending older logs (next: 1390431637) 2020/06/07 06:56:23 [INFO] raft: pipelining replication to peer {Voter b5320274-520b-08e2-7117-6d40bf0ac936 Follower2_IP:8300} 2020/06/07 06:56:26 [INFO] serf: EventMemberJoin: Follower2 IP 2020/06/07 06:56:26 [INFO] consul: Adding LAN server Follower2 (Addr: tcp/IP:8300) (DC: dc) 2020/06/07 06:56:26 [INFO] consul: member 'Follower2' joined, marking health alive 2020/06/07 06:56:35 [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration 2020/06/07 06:57:21 [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration ```
The return of the leader(as follower) ``` 2020/06/07 07:12:41 [INFO] serf: EventMemberJoin: Leader.dc leader_IP 2020/06/07 07:12:41 [INFO] consul: Handled member-join event for server "Leader.dc" in area "wan" 2020/06/07 07:12:41 [INFO] serf: EventMemberJoin: Leader IP 2020/06/07 07:12:41 [INFO] consul: Adding LAN server Leader (Addr: tcp/IP:8300) (DC: dc) 2020/06/07 07:12:41 [INFO] consul: member 'Leader' joined, marking health alive 2020/06/07 07:12:47 [WARN] raft: AppendEntries to {Voter f0800617-4070-9d22-3b80-48ea669e5fa5 Leader_IP:8300} rejected, sending older logs (next: 1390442237) 2020/06/07 07:12:49 [INFO] raft: pipelining replication to peer {Voter f0800617-4070-9d22-3b80-48ea669e5fa5 Leader_IP:8300} 2020/06/07 07:12:53 [WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration ```

These are the values of the X-Consul-Header returned from each server for the same service as to my original message above:

http://New_Leader:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true X-Consul-Effective-Consistency: stale X-Consul-Index: 1390431636 X-Consul-Knownleader: true X-Consul-Lastcontact: 0

http://Follower1:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true X-Consul-Effective-Consistency: stale X-Consul-Index: 1390436200 X-Consul-Knownleader: true X-Consul-Lastcontact: 39

http://Follower2:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true X-Consul-Effective-Consistency: stale X-Consul-Index: 1390442236 X-Consul-Knownleader: true X-Consul-Lastcontact: 45

Besides not solving our issue, it looks like the scale of it is worse. In the graphs above we see that the amount of services reporting 'IndexWentBackwards' increased drastically after our maintenance.

hi @jsosulska

  1. This is a bare metal datacenter that we manage
  2. Consul is deployed as a service on each of the Kubernetes nodes (as part of the OS and not a running deployment). We use registrator for managing registration/de-registration of containerized services into consul
  3. Will give it a look, thanks :)
pierresouchay commented 3 years ago

Fixed by https://github.com/hashicorp/consul/pull/8474