hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Cache is sometimes returning very old values for a very long time #8980

Open pierresouchay opened 3 years ago

pierresouchay commented 3 years ago

Overview of the Issue

We are using a lot of cached queries for our agents.

On some machines, on Windows (only seen on Windows for now), we sometimes have agents that are never updating their cache. We see this error on a pool of identically configured machines (pool for 200+ machines), it happens randomly on some of those machines.

In such cases, the machine can have stuck results for days.

Requests are doing the following way:

http://localhost:8500/v1/health/service/<serviceName>?cached&index=<LastIndexSeen>

With header: Cache-Control: max-age=60

We also have those configurations for all of our agents:

"cache": {
    "entry_fetch_rate": 0.1,
    "entry_fetch_max_burst": 4
 },
"discovery_max_stale": "5m"

It happens regularily, but we did not find any way to reproduce it for now, but the cache is completely staled. The X-Consul-Index is stuck is in the past (while all servers have the new value).

A call with stale on the agents give the following value:

http://localhost:8500/v1/health/service/<serviceName>?stale
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
X-Consul-Effective-Consistency: stale
X-Consul-Index: 5306733927
X-Consul-Knownleader: true
X-Consul-Lastcontact: 55
Date: Mon, 19 Oct 2020 10:51:40 GMT
Transfer-Encoding: chunked

While with cached, it gives the following value:

HTTP/1.1 200 OK
Age: 0
Content-Type: application/json
Vary: Accept-Encoding
X-Cache: HIT
X-Consul-Effective-Consistency: leader
X-Consul-Index: 5285318269
X-Consul-Knownleader: true
X-Consul-Lastcontact: 2
Date: Mon, 19 Oct 2020 11:11:07 GMT
Transfer-Encoding: chunked

Machine will keep this index forever (5285318269) until the agent is being restarted.

The difference between X-Consul-Index is huge. We had thise error on several different services, happens on ~ 1/200 machines (for now, only seen on Windows agents running 1.8.4, but error might have been older).

We had this on several machines, several different services, no specific relation between those machines (boot-time...)

Metrics of the agent containing cache are:

# HELP consul_acl_token_cache_hit consul_acl_token_cache_hit
# TYPE consul_acl_token_cache_hit counter
consul_acl_token_cache_hit 156825
# HELP consul_acl_token_cache_miss consul_acl_token_cache_miss
# TYPE consul_acl_token_cache_miss counter
consul_acl_token_cache_miss 8808
# HELP consul_consul_cache_entries_count consul_consul_cache_entries_count
# TYPE consul_consul_cache_entries_count gauge
consul_consul_cache_entries_count 58
# HELP consul_consul_cache_evict_expired consul_consul_cache_evict_expired
# TYPE consul_consul_cache_evict_expired counter
consul_consul_cache_evict_expired 46
# HELP consul_consul_cache_fetch_error consul_consul_cache_fetch_error
# TYPE consul_consul_cache_fetch_error counter
consul_consul_cache_fetch_error 11
# HELP consul_consul_cache_fetch_success consul_consul_cache_fetch_success
# TYPE consul_consul_cache_fetch_success counter
consul_consul_cache_fetch_success 18663
# HELP consul_consul_cache_health_services_fetch_error consul_consul_cache_health_services_fetch_error
# TYPE consul_consul_cache_health_services_fetch_error counter
consul_consul_cache_health_services_fetch_error 11
# HELP consul_consul_cache_health_services_fetch_success consul_consul_cache_health_services_fetch_success
# TYPE consul_consul_cache_health_services_fetch_success counter
consul_consul_cache_health_services_fetch_success 18663
# HELP consul_consul_cache_health_services_hit consul_consul_cache_health_services_hit
# TYPE consul_consul_cache_health_services_hit counter
consul_consul_cache_health_services_hit 254
# HELP consul_consul_cache_health_services_miss_block consul_consul_cache_health_services_miss_block
# TYPE consul_consul_cache_health_services_miss_block counter
consul_consul_cache_health_services_miss_block 27115
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 54528
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 65536

We are mostly using /v1/health/service/ for cached entries, there are a few errors, but not that much, we continue investigating this issue, I'll post updates there, but I really suspect a big cache bug somewhere

y3llowcake commented 3 years ago

I am also seeing this behavior in production, unable to reproduce, the issue is sporadic, restarting the agent seems to be the only solution so far.

pierresouchay commented 3 years ago

@y3llowcake Interesting, do you have this for Windows as well?

y3llowcake commented 3 years ago

No, this was on Linux. I actually have a debugger attached to the agent right now to try and figure out what is going on, but I am reaching the end of my rope.

pierresouchay commented 3 years ago

it happens only on 1 service or several?

y3llowcake commented 3 years ago

I only noticed it happening for one service. New datapoint: after detaching the [dlv] debugger the issue cleared up. Maybe I forced it to timeout?

pierresouchay commented 3 years ago

yes, we have seen something similar, does not happen very often, no lead for now (while I looked the code quite a lot)

y3llowcake commented 3 years ago

Other things I am relatively sure of, not sure if it is similar for you Pierre:

MorphBonehunter commented 3 years ago

After upgrading my consul servers/agents today to 1.9.0 i realized that i lost a lot of services in my traefik environment which reads the consul catalog through an local agent. After finding an comment in one of the traefik issues (https://github.com/traefik/traefik/issues/7591#issuecomment-734275821) i give this a try and disable the cache directive and it works like before.

It doesn't seem that an restart of the agent resolves the problem for me completely because after restarting the agent i realize that there still services missing which where available before the restart (and still available on other instances). So maybe the agent fetches an incorrect service list from the server? Is this imaginable?