hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.24k stars 4.41k forks source link

Blocking calls on /v1/health/service/<service> never block, resulting in request flooding #1244

Closed agemooij closed 7 years ago

agemooij commented 8 years ago

We are trying to watch for changes in the healthy nodes for a certain service in order to adjust the configuration of our API gateway. We do this in a way very similar to consul-template, i.e. by doing a blocking call on /v1/health/service/<service>. Here's a typical request:

GET /v1/health/service/blaze-canary-service?index=3232540&wait=300s&passing

The problem is that it seems these calls never actually block since the X-Consul-Index is different for every single request. This results in these watches flooding our servers with continuous requests since we expect them to only return if something has changed or when the requests time out. We have a lot of services so the resulting load on our servers is enormous.

Our services are not flapping on their health and the actual health status in the responses is identical so we don't see the reason for these indexes to be different on every request.

It appears from our initial testing that consul watch and consul-template are running into the same problem, always returning immediately even if nothing has changed.

Is this a bug in the way Consul determines the index of the v1/health/service/<service> endpoint or are we making incorrect assumptions?

sethvargo commented 8 years ago

Hi @agemooij

What is the motivation for selecting such a large index? Consul is going to keep returning until it hits that index. Your first request should always use index 0. Then, the response will include a response index that you add to the next query. You can see the Go-version of this here: https://github.com/hashicorp/consul-template/blob/a020700c8ed8ea70450460f0eedef21197738ef2/watch/view.go#L120-L170.

You need to add some logic to check the return values, specifically the stale and last index values and then store those results for comparing in the next request.

What version of Consul are you running? Can you restart your cluster with debug logging and send us a gist?

agemooij commented 8 years ago

The index in that example request is taken from the X-Consul-Index of the previous response. The fact that that index is very high might be a hint to a root cause. But what I can see is that when I do a simple non-blocking request to the health endpoint of one of our services and I do that 5 times in a row, I get 5 different values for X-Consul-Index (see the output I pasted below)

So the question is therefor why that index is incremented so often. Can this be caused by incorrect usage or incorrect configuration?

Example output:

> http -h mesos-master:8500/v1/health/service/blaze-iplocation-service
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1092
Content-Type: application/json
Date: Wed, 16 Sep 2015 11:18:32 GMT
X-Consul-Index: 3253526
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

> http -h mesos-master:8500/v1/health/service/blaze-iplocation-service
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1092
Content-Type: application/json
Date: Wed, 16 Sep 2015 11:18:34 GMT
X-Consul-Index: 3253531
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

> http -h mesos-master:8500/v1/health/service/blaze-iplocation-service
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1092
Content-Type: application/json
Date: Wed, 16 Sep 2015 11:18:35 GMT
X-Consul-Index: 3253536
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

> http -h mesos-master:8500/v1/health/service/blaze-iplocation-service
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1092
Content-Type: application/json
Date: Wed, 16 Sep 2015 11:18:37 GMT
X-Consul-Index: 3253539
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

> http -h mesos-master:8500/v1/health/service/blaze-iplocation-service
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1092
Content-Type: application/json
Date: Wed, 16 Sep 2015 11:18:38 GMT
X-Consul-Index: 3253540
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
agemooij commented 8 years ago

@sethvargo just to be clear, what we do for all requests except for the first one is to set the index query param to the value of the X-Consul-Index of the previously received response, indicating that we are interested in receiving a response only if something has changed since that specified index. AFAIK this is straight from the Consul Http API docs.

agemooij commented 8 years ago

We are using Consul 0.5.2. We can restart the cluster in debug mode but it might have to wait until tomorrow because our mutual acquaintance Bas Tichelaar just left for a meeting and he's the one with access.

agemooij commented 8 years ago

I just found this mailing list discussion: https://groups.google.com/forum/#!topic/consul-tool/EP10Wdb1QKQ

It seems to indicate health checks can cause idempotent index increments, i.e nothing has changed but the index does increment:

In practice, the biggest cause of writes is health checks that change output on each run.

I've compared the output of the sample calls I've posted above and the output is always the same for that service while the index keeps incrementing. I would expect the index for a certain service/endpoint not to be influenced by writes to other, unrelated endpoints (i.e. health checks for other services or unrelated kv writes, etc.). Is that indeed how it works? If not, what is the scope for these indices?

Are there other common causes for idempotent writes that might be causing our indices to be incremented so often?

agemooij commented 8 years ago

I did some more tests and I found that sometimes the output of a health check is indeed "different" but those differences are purely in the ordering of objects in JSON arrays, i.e. the order of the nodes in the response is sometimes different. Would that cause the index for that service to be incremented?

If so, that would explain the behaviour that we are seeing, although I would argue that in that case it would also mean that the Consul change detection system is broken by the way Consul itself generates JSON (i.e. with non-deterministic JSON array sorting).

slackpad commented 8 years ago

Hi @agemooij - the Consul health checker isn't aware of any format for the check outputs so it does a simple string compare to see if they have changed; this would definitely cause write and index churn even if the health status is stable. Go intentionally randomizes map iteration order which I think causes the non-determinism in our JSON output.

If you can filter your health check outputs to make them deterministic before Consul sees them, then they should be truly idempotent:

https://github.com/hashicorp/consul/blob/master/command/agent/local.go#L241-L244

agemooij commented 8 years ago

We are returning JSON arrays in our health check outputs so this might very well be the cause of our index churning. Il'll investigate immediately and report back.

If this is really the cause of our problems it might be good to add some kind of "health check best practices" note to the Consul docs to prevent other people making the same mistake. If such a section already exists, consider me eating humble pie and shamefully sneaking off to RTFM :smile:

agemooij commented 8 years ago

Hi @slackpad and @sethvargo, we ran some tests and this is the current situation:

The big question is therefor: what could be triggering index increments for those two endpoints for every single service we have running? Especially if we can verify that their individual health checks do in fact return idempotent responses.

Just to be sure, does Consul guarantee that each individual service has its own index and that changes to one service don't trigger an index increment to another service?

bastichelaar commented 8 years ago

Snippet from the consul monitor -log-level=trace output of the Consul leader: http://pastebin.com/sqUWtYix

slackpad commented 8 years ago

Hi guys - it looks like the ids are not scoped per-service for the health checks, it tracks one for the table, so if any check changes it would wake up your watches (even services whose check output doesn't change). If there's one check some place that's non-deterministic then it could be causing the wakeup problem you are seeing. Looking in the agent code, there's a feature that only sends the output every 5 minutes if the service status hasn't changed to throttle these down, but if you had a ton of nodes then this could still be pretty frequent.

Alternatively, if something was registering and unregistering a service over and over again (one that had checks associated with it) I think this would ratchet up checks and services as well.

Do any of these sound like they could be what you are seeing?

agemooij commented 8 years ago

We might very well have one service with a flaky health check and that is certainly something we are going to investigate ASAP. AFAIK we do not have any flapping services since this is something we monitor pretty well; but we will double-check to be sure.

But can I ask the obvious question?

Hi guys - it looks like the ids are not scoped per-service for the health checks, it tracks one for the table, so if any check changes it would wake up your watches (even services whose check output doesn't change).

This is extremely surprising and from your language (i.e. "it looks like") it seems you didn't expect this either. Is this as-designed or did we just find a major bug in Consul?

What do you mean with "the table"? Which exact endpoints share the same index? From our output it looks like at least all /v1/health/service/<service> and /v1/catalog/service/<service> endpoints share the same index. The usefulness of watches, especially through blocking http calls with reconnect-on-timeout behavior, is pretty much reduced to zero if this is true.

The docs definitely imply that indexes are per-resource/service instead of shared globally. For example look at the "Blocking Queries" section of the HTTP AP docs where it says:

Any endpoint that supports blocking will also set the HTTP header X-Consul-Index, a unique identifier representing the current state of the requested resource.

Are the docs wrong or is this a bug?

slackpad commented 8 years ago

Hi @agemooij it's complicated, that's why I had to go look in the code :-)

Each entity (node, service, check, kv) has a separate table, and Consul tracks an index per table, so there's no inherent coupling between these tables. If you delete a node that has a service and a check, though, it will bump all three of those indexes because it hit all three tables. For the KV table, we store extra index information associated with each key so that watches can be smarter about determining if a write to the table (which always bumps up the table's index) applied to the key or subtree you were watching. This hasn't been implemented for all the other entity types yet, though we intend to do that soon, so any check update will wake up all the watches on health checks right now.

I traced through the path where the agents push check updates to the server and it's through the Catalog.Register RPC call, which will update the service and check tables at the same time (even if the service part of the definition didn't change) so it makes sense that those indexes are going up as well.

This definitely looks like there's a check with an output that's non-deterministic spamming up your Raft log and waking up your watches. We are actively working on improving the granularity of these queries so you won't get this behavior (only the churn-ey service would be waking). Sorry for the confusion on this! Also let me know if you don't find the culprit or think it's something else and I can help you track it down.

I'll leave this open as a reminder to improve the docs. We should definitely add some notes about how output is handled.

agemooij commented 8 years ago

Thanks @slackpad for the update. A fix to make the granularity of updates match with expectations and documentation would be very much appreciated.

We are actively looking for all sources of non-identical health check responses and we have already found some obvious sources; mostly of health checks running against software we don't control, like Cassandra and Prometheus. Those services have timestamped health check responses. In my experience this is quite common behavior so finding out that Consul is not compatible with such situations is troubling.

I think a lot of people using consul-template or other forms of watches might be running into this same problem, probably without even noticing except for some mysteriously high loads on their Consul clusters.

It seems some subtler, more configurable behavior is probably called for here. Perhaps some way of either ignoring or semi-interpreting the health check responses?

bastichelaar commented 8 years ago

We managed to find the cause of this issue. First, we made sure that all health check responses are idempotent, but the index was still increasing. So we started removing containers until we found the issue: registrator.

Registrator has a -resync option, which was enabled in our setup. We enabled this option for better recovery after restarts of cluster nodes. When registrator resyncs, it re-registers all services, which of course causes multiple writes in Consul. This triggered all our watches, so the index increased pretty fast.

Having an index per service would not have solved this, but still makes sense from our perspective. A change in service A should not cause triggering a watch on service B. We would still like to know if and when this behavior will be changed in Consul.

I will also create an issue for registrator, and will contribute this to the documentation of registrator. Anyone who is using the -resync option will probably have this issue, maybe even unnoticed :smile:

slackpad commented 8 years ago

Thanks for the updates and I appreciate the ping on registrator.

@agemooij - I'd hesitate to make Consul aware of the format of any output; wrapping a check in a script that preps and filters is probably better for keeping Consul complexity low. We can make the documentation way better, though, and it might be worth a periodic warning if a healthy service is churning its output so it's more obvious what's going on (and easy to find and fix without having to run through all your checks manually).

mbroome commented 8 years ago

So are you saying the tables are shared across say everything that is a service or is it just everything of a given service on a given node. Like having multiple checks on the same service can trigger a change for the service's index but a change to a different service on the same machine will not trigger a change for all services on the machine?

Right now, I have a service that basically cats a file on a all of the machines in the cluster. The file's contents do not change but the index number is incrementing. Is this intended or is it some other service (some containing timestamps in thier output) that is causing it to increment?

agemooij commented 8 years ago

@mbroome as far as I understood the granularity of writes is cross-service at the moment. I'm not exactly sure where the granularity borders of these "tables" are but in practice for us this means that any write for service X on any Consul endpoint (catalog, health, etc.) will also trigger all watches for all other services on all other endpoints.

@slackpad It would be great if we could get a better insight of the current granularity of writes/watches. An initial high-level generalization would already be very helpful.

agemooij commented 8 years ago

@slackpad I agree that interpreting health check responses would be the last possible choice I expect you guys to make. I simply threw it into the discussion as an extreme alternative.

If the granularity of writes would increase, the side-effects of a chatty/flappy service would be greatly reduced. Next to that I agree that "writes per minute" or something similar would be a very helpful metric to have for a service so you can quickly debug problems like this.

slackpad commented 8 years ago

@mbroome Yeah @agemooij summarized it correctly "any write for service X on any Consul endpoint (catalog, health, etc.) will also trigger all watches for all other services". It doesn't sound like your cat check would cause churn, but some other check causing churn could affect that one.

@agemooij We will see what we can do to make this better - improved granularity of watches plus some better visibility for churn.

mbroome commented 8 years ago

So is there any plans to split the indexes for services up so one service does not effect another? Right now, I'm getting so much churn on indexes, it's not really practical to setup a watch for something that is not changing often. Even if I filter the output of healthchecks to cleanup things like timestamps, the next time a dev produces some content that I'm not aware of, everything will start to get reconfigured constantly even if it's not related to the service the healthcheck was for.

In it's current implementation, I can't really use watches.

slackpad commented 8 years ago

@mbroome - this is on our list. I'm not sure of the exact time frame but hopefully shortly after Consul 0.6 ships; we are trying for near the end of the month with that.

agemooij commented 8 years ago

@mbroome for us watches do still "work" but only because both consul-template and out own watches always compare the new response to the old response and only trigger reconfiguration when something has actual changed. The main impact we are seeing is the load produced by all these watches constantly being triggered and then resubscribing. This is not practical unless you have an extremely stable environment that hardly ever changes.

@slackpad Great news! Glad to hear it is on your list. We are talking to some teams at other companies we know about the impact of this issue on their systems. We expect that a lot more people will have made the assumption that watches will only be triggered by writes to the service they are watching and that the impact of this assumption has so far been relatively low for most teams only because they either have more stable clusters or because the load/request spikes have not been spotted or correctly identified. We only found out ourselves because its turns out our cluster was extremely chatty due to registrator so the load spikes were big enough to trigger us to dive into it.

rytutis commented 8 years ago

Hi. Any news on this? Still see it in 0.6.1. We'd like to watch select services for changes, but using health endpoint modification to any service releases all watches.

slackpad commented 8 years ago

Hi @rytutis we are still tracking this and hoping to get it into Consul 0.7.

rytutis commented 8 years ago

Thanks

mrwilby commented 8 years ago

Is this still on track for 0.7 ? Any rough idea when that will be? Q2 or H2 ?

slackpad commented 8 years ago

@mrwilby this may not make it into 0.7 given our current scope, but I think Q2 seems pretty likely. Will update this when I have a better estimate.

bmonkman commented 8 years ago

Joining the dogpile on this one. This is particularly problematic when combined with consul-template. We are seeing long quiescence periods in environments with a lot of churn that would be helped significantly if these indices were actually per-resource rather than shared.

Loushu commented 8 years ago

We've been seeing the same issue. One observation we've made is that an always failing health check will cause index churn. I would've expected for index to increment only if service's health state transitions, not when it's constantly failing. Is this behavior by design?

jhmartin commented 8 years ago

@Loushu when you see that, is the output of the failing check changing run to run?

jhmartin commented 8 years ago

Having any service anywhere changing state causing all watchers everywhere to cycle seems like a significant impediment to large scale, I am anxious to see improved granularity to them.

Loushu commented 8 years ago

@jhmartin : We've tracked down index churn to 2 distinct causes. First one is related to failed health checks for non-running services on Windows, where the text message of failed check would vary between 2 messages. One is a timeout message and the second one is connection refused message. Second cause of index churn is more interesting. In our setup we use ACL tokens for service registration, which are passed with registration request as a query string parameter. We also have configured consul to deny registrations without ACL token by default. What seems to be happening in this case is when agents do a periodic entropy sync, they try to read the catalog and the read request gets denied access to all services except 'consul' service, and the agent writes to catalog, because services returned from catalog and its state don't match. Each agent does this once a minute, and given enough agents this causes pretty frequent index churn. The workaround was to include a token with "read" access to all services in each agent's configuration.

rohitjun commented 7 years ago

@slackpad any updates on delivery? version ? The problem exists on 0.7 but less often than 0.6.4. I wonder what has been changed.

ececbv commented 7 years ago

Hi, back to the question above. When is this granularity to be provided?

kamaradclimber commented 7 years ago

Same problem for us. We can frequently have a look at what users put as health checks but don't control it at the source. Having indexes per service would definitely help (or consul filtering any empty update before unblocking queries of course).