Open pierresouchay opened 3 years ago
I am also seeing this behavior in production, unable to reproduce, the issue is sporadic, restarting the agent seems to be the only solution so far.
@y3llowcake Interesting, do you have this for Windows as well?
No, this was on Linux. I actually have a debugger attached to the agent right now to try and figure out what is going on, but I am reaching the end of my rope.
it happens only on 1 service or several?
I only noticed it happening for one service. New datapoint: after detaching the [dlv] debugger the issue cleared up. Maybe I forced it to timeout?
yes, we have seen something similar, does not happen very often, no lead for now (while I looked the code quite a lot)
Other things I am relatively sure of, not sure if it is similar for you Pierre:
After upgrading my consul servers/agents today to 1.9.0 i realized that i lost a lot of services in my traefik environment which reads the consul catalog through an local agent. After finding an comment in one of the traefik issues (https://github.com/traefik/traefik/issues/7591#issuecomment-734275821) i give this a try and disable the cache directive and it works like before.
It doesn't seem that an restart of the agent resolves the problem for me completely because after restarting the agent i realize that there still services missing which where available before the restart (and still available on other instances). So maybe the agent fetches an incorrect service list from the server? Is this imaginable?
Overview of the Issue
We are using a lot of cached queries for our agents.
On some machines, on Windows (only seen on Windows for now), we sometimes have agents that are never updating their cache. We see this error on a pool of identically configured machines (pool for 200+ machines), it happens randomly on some of those machines.
In such cases, the machine can have stuck results for days.
Requests are doing the following way:
http://localhost:8500/v1/health/service/<serviceName>?cached&index=<LastIndexSeen>
With header:
Cache-Control: max-age=60
We also have those configurations for all of our agents:
It happens regularily, but we did not find any way to reproduce it for now, but the cache is completely staled. The X-Consul-Index is stuck is in the past (while all servers have the new value).
A call with stale on the agents give the following value:
While with cached, it gives the following value:
Machine will keep this index forever (5285318269) until the agent is being restarted.
The difference between X-Consul-Index is huge. We had thise error on several different services, happens on ~ 1/200 machines (for now, only seen on Windows agents running 1.8.4, but error might have been older).
We had this on several machines, several different services, no specific relation between those machines (boot-time...)
Metrics of the agent containing cache are:
We are mostly using /v1/health/service/ for cached entries, there are a few errors, but not that much, we continue investigating this issue, I'll post updates there, but I really suspect a big cache bug somewhere