Open cshabi opened 4 years ago
What is really weird is that followers have greater index than leader, to me, that's a kind of corruption. For each of those results, does the index returned is part of the output in fields CreateIndex or ModifyIndex (if it is not, something is really wrong and I'll delete the raft database on this machine)
We had similar this once on few of our old servers in a cluster, which lead to other issues such as https://github.com/hashicorp/consul/issues/6181
In order to fix that, for each server, 1 by 1 we did:
We add this only once in our clusters (12 clusters), on 2 servers in same cluster, and since those servers were quite old, we never went far further into investigations, but you might have encountered similar issue.
To debug this, you might use https://github.com/pierresouchay/consul-ops-tools/blob/master/bin/consul_check_services_changes.sh that helps by showing differences in output very easily
Hi @pierresouchay, thanks for the quick response!
We replaced our consul servers a month ago (one by one) with completely new servers. Wouldn't that simulate the steps you suggested above?
We will check out the tool you added and to incorporate it into our monitoring stack
@cshabi yes, indeed, this is equivalent
Hi @cshabi ,
Thanks for posting. I'd like to collect some additional information, if possible.
Looking forward to hearing back from you!
hi @pierresouchay
Following are the steps we did (repeated for all servers):
Im adding some logs and graphs: These graphs show when we stopped each of the servers(we started them about 2 minutes after): Stopping consul Follower1:
Stopping consul Follower2:
Stopping the leader:
Theres are the logs for when each server rejoined the cluster (it looks like the leader provided old logs)
These are the values of the X-Consul-Header returned from each server for the same service as to my original message above:
http://New_Leader:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true X-Consul-Effective-Consistency: stale X-Consul-Index: 1390431636 X-Consul-Knownleader: true X-Consul-Lastcontact: 0
http://Follower1:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true X-Consul-Effective-Consistency: stale X-Consul-Index: 1390436200 X-Consul-Knownleader: true X-Consul-Lastcontact: 39
http://Follower2:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true X-Consul-Effective-Consistency: stale X-Consul-Index: 1390442236 X-Consul-Knownleader: true X-Consul-Lastcontact: 45
Besides not solving our issue, it looks like the scale of it is worse. In the graphs above we see that the amount of services reporting 'IndexWentBackwards' increased drastically after our maintenance.
hi @jsosulska
Overview of the Issue
We are using blocking queries to watch for health changes in our services, applying the standard mechanism of polling with some index and using the returned X-Consul-Index header as the index for the next poll. We query the local consul agent with a query that looks like (note the stale=true): v1/health/service/ServiceName?index=&passing=true&wait=1s&stale=true&tag=environment-prod
We recently found out that each server is consistently returning the same index (can be for many hours, until the watched service changes), but different servers are returning different indexes.
Since we are using stale=true, this is causing the blocking queries to return X-Consul-Index that keeps changing up and down according to which consul server was queried.
Is this issue expected? Is the combination of stale + blocking queries not supported?
http://Follower1:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true Vary: Accept-Encoding X-Consul-Effective-Consistency: stale X-Consul-Index: 1672207871 X-Consul-Knownleader: true X-Consul-Lastcontact: 34
http://Follower2:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true Vary: Accept-Encoding X-Consul-Effective-Consistency: stale X-Consul-Index: 1672201982 X-Consul-Knownleader: true X-Consul-Lastcontact: 7
http://Leader:8500/v1/health/service/ServiceName?index=0&passing=true&wait=1s&stale=true Vary: Accept-Encoding X-Consul-Effective-Consistency: stale X-Consul-Index: 1672198845 X-Consul-Knownleader: true X-Consul-Lastcontact: 0
When we remove the
stale=true
parameter we always get "1672198845" which is the index returned by the leader. We also noticed that for the problematic services in this state, theModifiedIndex
andCreateIndex
are the same and identical to the value returned in the header (then setting stale=true)Consul info Follower1
Follower1
``` agent: check_monitors = 2 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 39f93f01 version = 1.2.1 consul: bootstrap = false known_datacenters = 4 leader = false leader_addr =Consul info Follower2
Follower2
``` agent: check_monitors = 2 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 39f93f01 version = 1.2.1 consul: bootstrap = false known_datacenters = 4 leader = false leader_addr =Consul info Leader
Leader info
``` agent: check_monitors = 2 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = 39f93f01 version = 1.2.1 consul: bootstrap = false known_datacenters = 4 leader = true leader_addr =Operating system and Environment details
All servers are running ubuntu16.o4 and kernel version: 4.15.0-99-generic Consul data dir is running on SSD, all servers have 96 cores
Most agents run on kube nodes(ubuntu16), services are running in containers and registered upon deployment using registrator. Other agents are running on variety of OS distribution(all are ubuntu flavours) but service in this case are registered using json flies.