hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.18k stars 4.42k forks source link

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984

Closed ShimmerGlass closed 4 years ago

ShimmerGlass commented 5 years ago

Overview of the Issue

We had two incidents where we had a high load on Consul, causing high write and read latency as well as full Consul outage. After investigation we noticed that blocking queries against our two biggest services were hitting the hardcoded watchLimit value in https://github.com/hashicorp/consul/blob/master/agent/consul/state/state_store.go#L62. watchLimit is currently set to 2048, this sets the limit on how many instances a watched service can have before hitting this limit to ~ 2048 / 3 = 681 as 3 channels are added per instance in this loop : https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L1722.

/v1/health/service/:service_name adds 3 watches per instance so the limit is ~682. /v1/catalog/service/:service_name only adds 1 watch per instance so its limit is 2048.

We confirmed this theory by cutting down one of these big services to 681 instances. This greatly reduced the load on Consul servers and restored the DC back to normal :

image

We later rolled out a version of Consul with this limit raised up to 8192. This showed big improvement in both latencies and load, while cutting down memory usage by ~two. We then scaled back the service we cut down to its original 711 instances with minimal impact on the cluster :

image

Here is a profile ran during this incident :

image

The hot path is in blocking query resolution. With the less fine grained watch Store.ServiceNodes() is called on virtually every change happening in the cluster, generating the load seen above.

Based on this information, hitting this limit greatly degrades Consul server's performance when the limit is slightly crossed (our biggest service has 1780 instances).

We will soon provide a PR allowing to configure this limit.

Reproduction Steps

Register a service with more than 682 instances. Generate some load and run /health blocking queries against this service with many clients, in our case around ~2K.

Consul info for both Client and Server

Server and client v1.2.3 with patches.

Operating system and Environment details

Windows and Linux.

pierresouchay commented 5 years ago

This broke 2 of our datacenters before Black Friday :-)

We will propose the following PRs:

  1. allow increase the 2048 limit, document this limit in the doc, when the limit is reached, display an warning message (every x watches to avoid polluting logs)
  2. allow to add a configurable delay in the watch loop to mitigate the issue when this limit is reached
  3. another PR for later to be able to get rid of the limit completely without spawning too many goroutines (still work to do on this one)

Here is the full load during our less important incident (outage was worse on other DC, but we lost metrics on the other DC):

capture d ecran 2018-11-21 a 17 54 19

Zoom on time when we updated with the patch:

capture d ecran 2018-11-21 a 19 03 35

The drop is when we applied the patch, the 2 last lines red lines are when we switch back all features on (meaning the DC is fully operational)

Timeline:

The patch had the same effect on the 2 DCs

@Aestek will provide the PRs

banks commented 5 years ago

allow to add a configurable delay in the watch loop to mitigate the issue when this limit is reached

We talked about this before in a few cases. In some it might be OK, but in general this seems to be a kinda ugly fix around the problems of watches currently. We do this internally for cross-DC watches for example though but I'd like to phrase it in terms of a "rate limit" rather than a delay which affects everything on error or delays messages on the happy path and violates assumptions about how blocking works in normal cases. E.g. having users configure a maximum rate of 1 per second or 1 per 10 seconds and enforcing that seems better than allowing them in insert sleeps and reason about the trade between processing time, sleeping time, and effective rate etc.

another PR for later to be able to get rid of the limit completely without spawning too many goroutines (still work to do on this one)

That could be interesting but can you please outline your proposal/research in an issue first? It seems hard to see how that can be achieved without major changes to how memdb works. Changing behaviour to just not use limited watches may actually make sense in some cases but without a lot of research about the tradeoffs it's hard to reason about the negative consequences well.

We have medium term plans to build a fundamentally different and much more efficient mechanism for watching changes that will completely remove the need for any of that work, so it's even harder to justify merging major changes to watching ahead of that to "fix" this specific case too.

pierresouchay commented 5 years ago

@banks https://github.com/hashicorp/consul/pull/4986. is probably good enough for now with current design, at least it allows tuning for configuration like ours to work without breaking

The patch is highly effective and really changed completely our level of service, but it is probably not enough for long term. We discussed a lot today with @Aestek about a patch to improve quite a lot the watches mechanism, and I think we found a solution to get very strong performance while not needing changes such as https://github.com/hashicorp/go-memdb/issues/51

We will probably have something to show before mid of next week, but basically, we will have one single goroutine per watch client and will parse only once per watch all the instances, resulting probably of a huge drop of CPU and memory for those operations.

Some results after 10 days in prod

Beware all the graphs are logarithmic! For many metrics we have a 50x improvement (especially service registration and read stales)

Here is a logarithmic graphs since we applied this patch on our Cluster, all performance did increase by order of magnitude (latency on writes especially):

FSM latency:

capture d ecran 2018-12-01 a 00 01 18

Read Latency: (the low part between the 2 blues lines are when we deactivated lots of services on this DC):

capture d ecran 2018-12-01 a 00 09 33
pierresouchay commented 5 years ago

@banks As discussed, there is room for further optimization, you might have a look to https://github.com/hashicorp/consul/pull/5050 which will allow to remove any kind of watch limits eventually

banks commented 5 years ago

@Aestek @pierresouchay this is really cool! Thanks for looking into this.

I want to talk through the 3 or so different directions talked about here before we start merging changes.

Streaming Proposal

Firstly, as I mentioned, we have a medium-term goal to build a new mechanism I'm currently calling "streaming" that optimises a lot of watches. Not just this problem with goroutines but also the bandwidth problem of sending the whole response on every change or timeout.

That said, it could be some way out, and even when we build it we will still need to support the current blocking semantics from an API point of view. The current idea I have though is that client agents will be able to still expose the exact same blocking API to http clients but the internals and server will be able to use the new streaming interface so we'll be able to remove all of this blocking stuff without changes to memdb and better bandwidth while not breaking any external API.

This feels like the right long term solution and we intend to get to it in O(months) so we need to balance the value of having some big performance gains now vs refactoring the internals twice in quick succession.

MemDB fix

The PR you made in https://github.com/hashicorp/go-memdb/issues/51 makes a ton of sense to me. If we weren't planning a totally alternate thing that didn't need it I'd be on board with getting that merged and refactoring Consul to use it for watches. It's not obvious to me yet what the cost of that is though. Have you thought about how much of Consul blocking would need to change to support that?

Even if cost was high I'd be tempted to do that just because it's a much better approach overall, if we didn't have another even bigger refactor planned!

Shared Blocking #5050

This is a really smart idea. It's tempting to just merge it because it is lower cost and gets a lot of win, but I think it would need at least a bunch more effort and testing to be convinced it's solid - changing something so deep in Consul is never super trivial even when the idea is simple enough.

I'm on the fence though just because:

  1. I think there are likely subtleties that will need teasing out in that PR so it's not zero effort from our perspective or yours so needs to be traded against maybe being replaced in a couple of months
  2. It's not clear to me yet how much more (if any) work is involved with the memdb change above. If you have any input on that I'm interested to hear it. If it's only marginally more work to do that then it seems to make more sense to me.
  3. If for some reason the Streaming thing doesn't happen or turns out not to work so well, I think I'd rather fallback on the MemDB approach above for the longer term.

Making Limit Configurable #4986

I'm inclined not to merge this just because it's kicking the can down the road and in practice not many people seem to need that - a new config option has to stay forever even when it's not needed anymore it still can't make config parsing fail so I prefer not to add one that is not clear it will be used much.

If we do any of the other options it will become obsolete anyway, but it is the easiest thing by far that kinda helps you.

Summary

How do you see the pros and cons of these approaches?

Did you try out the memdb thing in Consul to see how much work it would be? (Not asking you to just checking if that's already something you've tried).

I realise it's hard to comment on the relative value of this given the Streaming thing is not fully designed and has no explicit timeline

pierresouchay commented 5 years ago

@banks Thank you bank, here are my remarks

Making Limit Configurable #4986

This patch was done quickly for us to be able to quickly fix the state of our clusters. Indeed, it is not very intrusive and the setting could be ignored when not relevant anymore (it would be very cool for us actually to have it merged as it would be one less patch for us)

Streaming Proposal

Well, that's probably the nicest, as it allow to limit serialization (the other performance issue we have) as well as limiting the bandwidth between agents and servers, but I suspect, this is quite hard.

Furthermore, Consul will probably have to deal with compatibility with older clients for a while, meaning this code is not gonna disappear before a while (Consul 2.x ?)

MemDB fix

This is the first thing we checked on how improving all of this, but the patch would be very intrusive, furthermore, it does not check that many clients (in our case more than 2k) are watching in each of their routine the same stuff. I really think this optimization can be done later and still providing benefits, but is far harder to get. In any case, any optimization there will benefit of any of all optimizations.

Shared Blocking #5050

Offers a nice way to deal with 2 problems:

For those reasons, I think Shared blocking has to be addressed first, because it changes the complexity of watching from O(n^2) to O(2n) globally, Yes, MemDB fix can be applied (whatever the strategy effectively applied), but the gains will likely be less important and most probably much more intrusive.

We are currently testing harder this fix, but the changes are huge: on a single service watch (simulated to around 1k clients), the system uses half the CPU, 1G less memory (on a single service... we are watching dozens of similar services in prod, the gains will be really huge for large installations!), reply immediately (usually, in that case, the Consul server is very slow, so calling consul info takes several seconds instead of instant response), so the benefits are huge now. The CPU time spent by Consul is then only dealing with serialization (something we will work on quickly).

Even when streaming will be done, the implementation will still exists as you will have to deal with ascending compatibility, so I think this is a good choice now.

In any case, having Shared Blocking #5050 merged will allow eventually to deal with clients having Index(n-1) efficiently (I mean once Streaming Proposal is implemented) and to deal with clients having index (n-m) (more complex case) individually. Usually, on a working system, all the optimizations we applied (#3899, #3970, #4720) had as a goal to limit the number of updates on a healthy system, so it means that now, even on a large cluster, watchers are gonna watch for a long time, thus optimizing this use case (aka most clients blocking always have the latest version of consul index for a given service is likely to be true.

I see the future with streaming diffs, but with the cache taking more and more importance in Consul (most notably with Connect), more and more people will deal with similar use-cases as we do (meaning, many clients watching large number of services with many clients), so I think those case of optimizations have to be done quickly for Connect to succeed.

Kind Regards

banks commented 5 years ago

Thanks @pierresouchay great response!

I think that makes a lot of sense.

Consul will probably have to deal with compatibility with older clients for a while

Officially we don't support running different versions of client and server except for the 1 major version delta during an upgrade. In practice we have to be more pragmatic of course, but I don't think that the fallback backward compat case benefitting from an optimization is a super strong reason to get this in earlier on its own.

I think it really comes down to how soon we are likely to get streaming done vs the cost of reviewing and testing the current PRs. We need to think about that some more internally - other future plans are also dependent on the streaming stuff so it needs to happen at some point.

Meanwhile we have a mountain of PRs and issues that we need to take action on as well which is a big factor in whether we want to spend time on solving this twice in quick succession!

All that said, I still love the #5050 solution, just not clear if it makes sense to invest in that if it will be such a temporary solution.

We'll keep you posted.

banks commented 5 years ago

I thought about this a bit and I'm interested. I spent a while thinking about hashicorp/go-memdb#51 and responded there - I think there is a change we can make to MemDB and iradix that is along those lines but much simpler and less invasive than modifying the mutateCh. I detailed it on that issue.

I also think there is a hybrid solution that offers us the same (or almost the same) benefits as #5050 but would be less work to port code too I think.

The idea is to have a package in Consul (for now see memdb issue for more) that implements its own WatchSet with a compatible interface. The only difference in usage is that you have to pass a pointer to a shared WatchPool when you create the WatchSet (this would be in the server struct) and you have to defer a Close() to clean up subscriptions that's a change that can be find/replaced pretty fast in Consul with minimal risk and would be all that's needed to port an endpoint.

https://gist.github.com/banks/1196bbddbd5fe561f2aef71aec6a40c5

The trade for this simpler no-memdb-change version looks like this: https://www.desmos.com/calculator/px7olswgks

For more than 32 subscribers it's always fewer goroutines, eventually way fewer for large numbers of subscribers. For less than 32 subscribers it can be worse than now just because it uses 1 goroutine per node being watched so a single blocking query on 1000 nodes uses 1000 goroutines. That's not great but could be optimised some with generated code like memdb users, or reduced to 0 with the changes I proposed in hashicorp/go-memdb#51

It's technically worse than #5050 but I argue it can be improved to 0 goroutines with minimal change later and should be easier to integrate now so probably worth it.

Not fully decided and that gist is 100% untested code so may be missing something, but if anyone on this thread wants to collaborate on this idea I'm really happy to work together.

To be clear I really like #5050 and it's still an option, but I think if it worked out, this proposal could be both easier to test and easier to integrate with existing Consul while providing a future pattern we could improve even further with small changes to Memdb.

What about streaming? Well... Thinking about it, if we built this (and later the memdb change to make it zero goroutines overhead for a blocking query to watch) then a relatively small additional tweak to the existing protocol for sending blocking query responses as a diff instead would get 99% of the benefit of streaming with much smaller set of changes. So I'm interested in pursuing this to see if it gives us a cheaper way to solve blocking scaling without my full streaming plan.

banks commented 5 years ago

@pierresouchay @Aestek after trying out an implementation of my suggestion/gist above. (Which I got working) I realised a couple of downsides:

  1. The tradeoff is pretty uncertain. For a lot of workloads there will be fewer than 32 blocking queries for at least some subset of queries which means any wins have to be traded of against those cases running more goroutines. In fact worse, even if you see huge wins from 1000 clients blocking on one query, you maight have 1000 other services each with only 10 watchers who are seeing more goroutines now so it's hard to guess which is better overall.
  2. The big win I saw there as mentioned in https://github.com/hashicorp/go-memdb/issues/51 was if we could have a bulk update mechanism that meant there could be zero goroutines overhead. I even built this to test it but quickly realised it has some downsides:
    • Takes a lot of plumbing to get a WatchPool that is coupled though the Server to FSM to StateStore to MemDB to iradix. Doable but the first smell that something is not so nice here.
    • It makes WatchSet used in blocking queries only work for memDB node chans (and the StateStore abandonCh) - we can't tell from outside the MemDB if a chan being watched is a memdb/iradix node or something totally other. While I don't think we do actually need it anywhere, it's pretty ugly that the abstraction is broken if anything ever tried to use the same code to also watch on other internal chans.

Instead I'm really happy to merge my idea with #5050 and produce something that should have exactly the same performance gain (infact slightly better as it applies to all blocking queries) but with way less code change needed.

Please let me know what you think about #5081. If it would be possible for you to try it out and confirm it performs as well as #5050 that would be awesome!

With this, the node watch limit in #4986 is still a problem and so we would either need to merge that, or just disable that limit entirely. My rationale for disabling it is that making it configurable is really just kicking the can down the road - if the practical advice is "set it so high you never hit it in production" then there is no real point in that at all. The limit is meant to be an optimisation on the resources a single query can use but if hitting it in practice causes way more server load due to the more frequent updates then it doesn't seem like a good idea at all. We'll discuss this some more in the team.

pierresouchay commented 5 years ago

I think your proposal #5081 is very nice and has the quality of being applied on any kind of blocking query very easily, we are gonna try it when we can.

As you said, the #4986 is still an issue with only this change being applied however. Still, I think it is a bit less an issue because the real trouble was that this limit was not documented and the change of behavior was very hard to catch (I took us several days to find the issue), and #4986 "does document it" by displaying a message in Consul logs (it would helped us quite a lot).

As a side node, #5050 also address another case: the late caller (which is actually a very frequent case in a large cluster, especially when many changes are applied in the index of the service very closely. With #5050, all "late participants" will get their result directly and won't parse all the watched services (except during serialization of course). While I agree the code is then a bit more complex (while it could be factorized across various endpoints having the same requirements), I have the feeling it also has a significant impact at our scale.

banks commented 5 years ago

@pierresouchay can I check I understand - the issue you are describing is that with #5050 the whole blocking query (i.e. traversing the memdb services and nodes and checks) only happens once too?

That is true, I'd love to see how different the performance is if you can try #5081 on your workloads. My hunch is that the serialisation of the result is probably at least as expensive as the actual execution of the memdb transaction but I could be very wrong -- measurements needed! It would certainly be worth confirming the difference is significant before we re-add that layer of complexity back over the top of #5081. That said, I think this is a good base line approach and it doesn't rule out further improvements that are more targeted.

In general, I wonder if you'd be able to help us define a good basic benchmark that at some level captures your use case? I guess I can setup 1000 clients blocking on a service with 1000 nodes and compare performances but if you could share some parameters for reasonable workload so we could collaborate on tuning with a standard way to measure that would be super useful for us to weigh up the cost vs. benefit of further optimisations.

pierresouchay commented 5 years ago

@banks (this post is co-written with @Aestek )

Yes, indeed the traversing of all results is performed only ONCE - whatever the number of clients connected. Before #5050, the complexity of algorithm to see if there is an actual change was m * n (m being the number of clients, m the numbers of components for a given service). With this change, the complexity is now "n" only. If the Index is lower than current index, the result is returned immediately as well.

The profile posted in https://github.com/hashicorp/consul/pull/5050 (https://github.com/hashicorp/consul/files/2644413/profile.zip) under load about shows that :

This happens even when the watchLimit has been increased (basically, in our clusters, the limit now defaults to 8192). Of course, when the limit of 2048 is reached, the situation worsen very quickly: in that case, nearly 100% (instead of 50% with 8192 limit) of CPU time is spent into checking continuously for updates for each client of the blocking query!

When there are many clients, much CPU time is spent doing the same exact thing for each of the clients, aka parsing all nodes (healthchecks, service, nodes), all ending with the same result. When limit of 2048 was reached, even more CPU time was spent doing this because the clients were woken up like every few "ms" instead of few "seconds", so traversal of the tree was done m times every "n" ms instead of m times every "n" seconds. Of course having many many goroutines did not help, but basically, the performance is linked to checking if we need to send result for each client.

To summarize

  1. Explosion of performance is not linked to the number of goroutines directly (even if it contributes thru GC, and continuously changing context of lightweights threads)
  2. Explosion of performance is linked to "m" clients checking for changes in the tree of health (service/checks/nodes)
  3. When 2048 limit is reach, Consul servers do break because the "m" clients do check for changes every "ms" instead of every sec, thus increasing those values avoid on a large cluster to be waken up every ms and being woken up every sec (or similar), when a change really occurs for this health endpoint (not when anything happens in the cluster).

The steps we took:

  1. Increase to max number of watches to 8192 to allow Consul Cluster to work and for me not being fired :-)
  2. Improve performance by removing the "m" clients checking the same stuff on every change - since very few goroutines are now used (n/32 - n being the number of nodes - we can now remove the watchLimit because for a given health endpoint, the number of watches does now depends of its content, not of the number of clients calling /health. So, all the graphs you see at https://github.com/hashicorp/consul/pull/5050#issuecomment-445300487 are the results of parsing ONLY once the result of health blocking queries instead of "m" times. Of course the diminution of goroutines is nice as well and contributes also to better performance, but the CPU diminution is also crucial to us (and to me #5081 does not fix that).

If I understand correctly the change #5081, this is still a great change, because it might be applied automagically to reduce the number of goroutines, but it does not address the "m" clients parsing the same stuff over and over and returning the same result.

To me, #5081 can be applied in combination with #5050 while it will have probably far less impact - because for this endpoint the number of goroutines is now bounded to n/32, still it will decrease the number of goroutines for any endpoint - which is great anyway.

Is that clearer?

banks commented 5 years ago

Thanks Pierre.

Yep that is really clear.

Overall I'd love to be able to measure all the options repeatably somehow and guide the effort with (more) data. I know you've been doing that in #5050 but it's hard to collaborate when we don't have a shared environment/benchmark to compare etc.

If you are able to try a similar workload with both #5050 and #5081 (only) that would be extremely interesting.

I've got a couple other thoughts I'll try out and I'd love to setup a benchmark I can really try things out "at scale" with too to try to be more scientific (saying I'm not being scientific so far not you!).

pierresouchay commented 5 years ago

@banks we will upgrade all of our cluster with our patch on Monday (we currently have 1 node with our latest patch) on each DC, then, we will try with your solution. At the same time, we are trying to reproduce our load issues in lab with a testing procedure, we will keep you updated

banks commented 5 years ago

I had another new perspective on this over the weekend.

I think beyond the WatchPool in #5081 the rest of the optimisations here are kind of hacks -- not necessarily without merit but I'll explain my thought process.

Consul uses a purely in-memory database. All queries should be super fast. Caching outside of the DB really should not be needed.

Almost all queries in Consul are pretty close to optimal already - a handful of in-memory lookups over a fast data structure. So caching any of these is needless complexity.

In the case of health endpoint, this is one of the few and by far the worst query in terms of complexity. If you take a step back for a second and think about Consul in database terms, the reason this is expensive is because we are doing a three-table inner join. It's in memory so for normal sized services the overhead is trivial and we've never considered optimising it.

But the real issue hit when there are thousands of clients hitting health queries with large services with maybe O(1000) instances, now we are doing that expensive O(3N) query over and over again. The approach of #5050 and my inflight hack are to amortise that cost by caching the result.

But if you stay with the wider database perspective, rather than adding a whole other layer of cache and complexity and have to reason about late readers, cache consistency, ACL enforcement through cache as well as DB etc. We should just optimize our DB structure to make health queries cheap and then not have to cache them at all.

Imagine if we added a new denormalised table to memdb that simply pre-computed the health results just like a "materialized view" in normal DB parlance. We can ensure it's always 100% consistent because it would be updates as part of the FSM transaction. I don't even think it would be that much code and with some thought I don't expect it should take that much extra RAM either provided we can use pointers to the canonical values for anything non-trivial.

If we did that, we save a lot of extra complexity, and we benefit performance across the board for every health query, not just for when there are loads of concurrent clients. A single lookup is now basically just as cheap as the cache fetch in #5050, and more scalable since MemDB allows concurrent readers without coordination vs a mutex on the cache.

What's more supporting a more general "materialised view" mechanism woudl be an awesome feature that would pretty much optimise away most of the tricky and annoying parts of consul-template and similar tools - they instead of composing tens of hundreds of different blocking HTTP calls, could just register a single materialised view with everything they need and then get minimal updates streamed when they need them. This last part is a good way off and far froma firm direction statement, but the more I look at streaming and actual problems users like you and others are hitting the more it seems like we should optimise for that use case - each agent needs to know not just one thing but a subset of the state and be updated in a timely way. If we had materialised view mechanism to maintain that efficiently without each agent or consul-template having to figure out and maintain separate blocking loops for many different queries, plus a diff-based transport to save bandwidth and CPU, I think we'd be in a great place!

The last part is super forward looking, the point of mentioning that is just that I don't think it's wasted effort near-term to consider how to maintain a single well-understood "materialized view" in the StateStore to optimise for arguably our most important and hottest code execution path!

Will think on this some more but I'd be much happier solving the performance issue in the health query than adding another layer of cache/complexity I think.

pierresouchay commented 5 years ago

@banks to me the materialized view is just a cache. As you explained yourself, it works in memory, so the cache mechanism would be actually very similar to #5050, but with the drawback of having to migrate several views each time there is a change in the structure of underlying structure, I hardly see how you will implement this without too much hassle, but I think is is definitely doable as we did the fire service Index some time ago. The most complex part being probably when a field is added on retrieval time (where the service is not modified but it's serialization is, as we did for weights for instance), write time could be trivial however.

I understand the arguments regarding the locks (while I think #5050 does a decent job regarding that), but I think you might consider using watchset first and see if it is still an issue.

We did this in #5050 because it was easy and because we needed desperately performance to avoid destroying our DCs, so we put all we could find quickly in it.

We are gonna AB test this change (your watch set implementation VS shared blocking queries) in our prod early January, we will keep you posted, but early tests done by @Aestek seem promising.

If there is a real difference with #5050, let's see what we can do, what do you think?

banks commented 5 years ago

to me the materialized view is just a cache... so the cache mechanism would be actually very similar to

I think the biggest difference is that it's part of the state machine. As such it's deterministic, and easy to write tests and have confidence that it's always consistent and never introducing subtle ways you can fail to update some clients, or give wrong value out to clients or leak stuff to unprivileged clients.

None of the tests for it have to manage concurrency (e.g. try to simulate many clients some that are late some aren't etc) which is always really hard to test both robustly and without being brittle/timing sensitive.

The agent cache for example turned out to be way more complex than we expected (and we know caches are never easy) and a source of lots of subtle bugs especially as it's been adopted in more places. Most of the complexity there is exactly around blocking and concurrency and edge cases that emerge from that and are really hard to test.

As a general pattern I see a huge advantage in making our core state machine update the DB in a way that makes our most important query workload efficient rather than slapping another layer of caching on top of it with all those additional challenges I mentioned.

but with the drawback of having to migrate several views each time there is a change in the structure of underlying structure

I think you are aware and saying the same thing but a nice thing about our MemDB is that it's rebuilt from a snapshot every restart. That means that we don't have to worry about the "materialised view" being persisted and having to migrate it since it's populated only from data that's already in the snapshot. So upgrades that change the data structure don't need to do anything - we just rebuild the view when the server starts and loads it's snapshot from disk!

The most complex part being probably when a field is added on retrieval time (where the service is not modified but it's serialization is, as we did for weights for instance)

I hadn't considered that, but I don't think it's a problem - in general anything we do outside the FSM can't be reflected in blocking queries since it doesn't affect the index so I think there is no risk here. The "view" is always consistent with the rest of the state machine data so anything we do outside the FSM in RPC handlers has the same properties as now (filtering, adding defaults, transforming fields etc).

I totally appreciate all the effort you and @Aestek have put into this both in #5050 and since then - I totally understand that you want that win merged ASAP, I'm just trying to think down the road whether we want to be maintaining yet another complexity on top of blocking queries which are already not super simple to work on!

I'll think about this some more over the break. I'm not ruling out #5081 and possibly the singleflight branch (although it has some of the same concerns I have with #5050 in terms of foot-guns and bad abstraction). #5050 has a lot going for it and may win your benchmarks, but my current feeling is that it changes too much and leaves too many things in a complicated/difficult state to maintain. I'd love to get that performance with a cleaner and more easy to reason about mechanism! My reluctance is that we have so much to do on Consul we already struggle with paying down tech debt, so adding a "not the cleanest option but a big win" patch is very likely to actually stick around a while

pierresouchay commented 5 years ago

@banks Ok, that's fine

I totally appreciate all the effort you and @Aestek have put into this both in #5050 and since then - I totally understand that you want that win merged ASAP, I'm just trying to think down the road whether we want to be maintaining yet another complexity on top of blocking queries which are already not super simple to work on!

We agree on that, as soon as #5081 works with around the same performance as #5080 (I mean up to ~50% less improvement is Ok for us), we are fine. Actually, that's exactly what I meant: if #5081 is good enough, let's go for it, I'll be quite happy. Further optimizations will be done when ready.

In any case, #5081 is a very good change - much more cleaner and with a wider optimization spectrum - so it is worth the case, whenever we need something similar to your proposal (in memdb) or a more traditional cache approach, I really do not care as soon as we can operate Consul smoothly (And for the first time in months, #5081 allow us to work far below the limits - while we were always just below the SLA limits for months).

We will test #5081 very fast and see whether it is good enough for us, but I have the feeling that's the case. If so, we will switch it happily. Due to Christmas, we will test it only early January, but we really plan to test it quickly.

banks commented 5 years ago

Another note, I was interested to see just how bad the performance of parseCheckServiceNodes is since that is one of the big wins that #5050 and singleflight avoid. It's all in memory, even with 1000 nodes I wasn't seeing where all the time goes so maybe we can just optimize that a bit and win all around?

Well, maybe but check out this benchmark (will PR the bench later):

BenchmarkCatalogParseCheckServiceNodes10WatchSet-8             50000         35681 ns/op       13413 B/op        381 allocs/op
BenchmarkCatalogParseCheckServiceNodes10NoWatchSet-8           50000         30772 ns/op       12208 B/op        377 allocs/op
BenchmarkCatalogParseCheckServiceNodes100WatchSet-8             5000        334354 ns/op      133241 B/op       3719 allocs/op
BenchmarkCatalogParseCheckServiceNodes100NoWatchSet-8           5000        296789 ns/op      121008 B/op       3707 allocs/op
BenchmarkCatalogParseCheckServiceNodes1000WatchSet-8             500       3645767 ns/op     1314353 B/op      37080 allocs/op
BenchmarkCatalogParseCheckServiceNodes1000NoWatchSet-8           500       3290186 ns/op     1215697 B/op      37007 allocs/op

This is creating a single service with 10/100/1000 instances in a state store each on a unique node with 2 node-level checks and 2 instance level checks each. Then it simply fetches all the service instances outside of the timing loop. And only times the parseCheckServiceNodes calls. It does each size both with a real WatchSet and with a nil one.

With 1000 services, each call is taking over 3ms and 37k allocations! My hypothesis was that accumulating the watch chans was the costly thing that did most allocations but it's not significant at all.

Not had time to profile yet, but I guess something in memdb (e.g. allocating iterators) is more expensive than we thought. It might be worth profiling this to see if there are ways we can make the query much more efficient in general even without denormalising. I have other ideas last night too like storing pointers in the records so this is just a series of pointer traversals not radix iterations. These might be cheaper to maintain that a full separate table while getting a lot of the performance benefit.

Just thoughts for now, I'll look further. My benchmark is here for now, but will PR later: https://gist.github.com/banks/e2025910b3e7f56d9c898d15c6dbd860 (in agent/consul/state package).

banks commented 5 years ago

I profiled and attempted to reduce allocations by using a sync.Pool for memdb index buffers but it actually made it worse since you have to do a load of bookkeeping that causes other allocations to be able to free the buffers at the right time. It could be improved but only with major changes to memdb.

Instead I made a super rough version of storing a "materialised view" of the Checks. I didn't even add a new table I just:

This passes all the catalog state store tests (although I'm sure it's too simple to not have any bugs etc), except the one that tests for the WatchLimit being overwhelmed (since it never is anymore).

The benchmark now looks like this:

BenchmarkCatalogParseCheckServiceNodes10WatchSet-8            500000          2774 ns/op         608 B/op          8 allocs/op
BenchmarkCatalogParseCheckServiceNodes10NoWatchSet-8          500000          2724 ns/op         608 B/op          8 allocs/op
BenchmarkCatalogParseCheckServiceNodes100WatchSet-8           300000          4621 ns/op        4290 B/op          8 allocs/op
BenchmarkCatalogParseCheckServiceNodes100NoWatchSet-8         300000          4457 ns/op        4290 B/op          8 allocs/op
BenchmarkCatalogParseCheckServiceNodes1000WatchSet-8          100000         16261 ns/op       41155 B/op          8 allocs/op
BenchmarkCatalogParseCheckServiceNodes1000NoWatchSet-8        100000         16356 ns/op       41155 B/op          8 allocs/op

For comparison, before it was:

BenchmarkCatalogParseCheckServiceNodes10WatchSet-8             50000         35681 ns/op       13413 B/op        381 allocs/op
BenchmarkCatalogParseCheckServiceNodes10NoWatchSet-8           50000         30772 ns/op       12208 B/op        377 allocs/op
BenchmarkCatalogParseCheckServiceNodes100WatchSet-8             5000        334354 ns/op      133241 B/op       3719 allocs/op
BenchmarkCatalogParseCheckServiceNodes100NoWatchSet-8           5000        296789 ns/op      121008 B/op       3707 allocs/op
BenchmarkCatalogParseCheckServiceNodes1000WatchSet-8             500       3645767 ns/op     1314353 B/op      37080 allocs/op
BenchmarkCatalogParseCheckServiceNodes1000NoWatchSet-8           500       3290186 ns/op     1215697 B/op      37007 allocs/op

So for 1000 nodes we are down from ~3ms per op to 16us. And from 36k allocations to just 8.

The update cost even for this naive implementation seems OK. I made a benchmark that adds 100 services to a single node (we only ever update service instances for a single node at a time so this is a reasonable conservative case), and then tries to re-register changing the tag each time (which pays the full cost of updating all 100 services every time).

The benchmark without the update:

BenchmarkCatalogEnsureRegistration-8       50000         26784 ns/op       14137 B/op        294 allocs/op

And with the update:

BenchmarkCatalogEnsureRegistration-8        3000        492835 ns/op      182770 B/op       5426 allocs/op

So it's not trivial, but still under half a millisecond per update for this reasonably "bad" case which seems acceptable.

Code for this WIP hack is here for now: https://github.com/hashicorp/consul/compare/x/bench-catalog?expand=1

Note I'm undecided about whether this internal pointer/cache of the CheckServiceNode is better or worse than a separate MemDB table long term. It's simpler to code up which is why I did it this way for now to see what the performance gain might be like. The big downside is that the field in the struct may be out of sync with the rest of the struct unless it's been through a registration (ones from MemDB should always be OK but wherever the struct is used outside would have nil or junk there potentially). We could change the services table to have a different top-level struct that can store this internally only I guess.

ShimmerGlass commented 5 years ago

@banks This looks awesome ! I'll try to bench this to see how it performs. I'm a tiny bit worried by the increase update time because we have very frequent updates in our clusters but i'm sure the cost can be optimized away in some cases.

banks commented 5 years ago

I'm a tiny bit worried by the increase update time because we have very frequent updates in our clusters

Worth validating for sure. I think the 0.5ms for a write is not a terrible worst(ish) case but the fact that it has to be serial with all other FSM updates could be an issue for very busy clusters where these service updates are a major part of the write load. I think you'd still need hundreds or thousands of catalog changes a second to run into a real problem though which I think doesn't apply to many real, even pathological, cases I know of.

The optimisation that could be done here is just to only rebuild for services that are definitely affected rather than all on the node each time - for example if the update is due to service:web healthcheck status changing, then you only need to update the check pointers for that one instance instead of all the node and check pointers for all the services on the node. If the node itself changes then you could optimise by only updating the Node pointers and not the health checks etc. of each CheckServiceNode. I'm not yet sure the extra complexity and risk of bugs if we get out of sync in our definition of what changed and what is rebuilt is worth it though.

A really simple optimisation actually is to remove the node lookup from the loop - right now in my branch we are only ever calling the populate* method with services that are all on the same node so we could just pass that node in and remove O(num_services_on_node) lookups for the same node in the radix tree right there! Same for node-level checks which can be looked up once and then just merged in saving another O(n) radix lookups.

Then we'd just need to be smarter about which service-level checks to lookup - only need to lookup ones for a service if the actual checks have changed, or the service is new.

ShimmerGlass commented 5 years ago

@banks yep that's what I was looking at (only rebuilding the things that actually changed).

Aside from that I did some testing on your branch. Migration from master to this branch seems to work fine AFAICT. I did some quick benchmarking and I see big improvements with this: consul used ~25% less CPU usage ! I'll run them again with extra care and post the full details.

You could maybe open a PR even if its an experiment so we can discuss your changes there ?

banks commented 5 years ago

Pushed some optimisations to do the same work more efficiently. It reduces allocs a decent amount but time is dominated by the updates to service indexes:

BenchmarkCatalogEnsureRegistration-8    p    5000       309207 ns/op      120244 B/op       2267 allocs/op
    422            .      290ms                 if err := tx.Insert("index", &IndexEntry{checkServiceIndexName(sn.ServiceName), idx}); err != nil { 
    423            .          .                     return fmt.Errorf("failed updating index: %s", err) 
    424            .          .                 } 

Can still optimise not doing this for services not affected etc.

@Aestek I don't want to PR until I'm more sure it's the right direction - it's easier to discuss pros/cons and compare implementations in this thread than on on specific PR. I'll be working again properly next week though and hope to look at this.

banks commented 5 years ago

Update: this is still on my radar. I've ended up deep in MemDB and iradix and have a bunch of ideas how we can improve performance there significantly.

The goal there was to improve that write time further when denomalizing the data and I believe I can make about 2x improvement in speed with changes to iradix and memdb. Those changes could also make watching a lot more efficient in the future too but will take some work.

To come back up to this level. I'm leaning towards the following actions here:

@Aestek let me know if you do any more testing with any of these approaches. I know a few people are interested in this thread so trying to balance optimal/most maintainable solution with getting something we can ship soon without rewriting the whole world!

banks commented 5 years ago

I've thought about this a lot and have a pretty good idea about how we can get this clean enough to merge.

1. Update Perf ++

My branch added a new index for the denomalized health results but it's not really necessary - the existing per-service index already updates any time either the service or its health checks change (there is actually a bug here I think but separate issue). Given that most of the cost in the update I build was those indexes, that's a pretty big win!

Bench vs last version on this branch:

benchmark                                old ns/op     new ns/op     delta
BenchmarkCatalogEnsureRegistration-8     234687        116646        -50.30%

benchmark                                old allocs     new allocs     delta
BenchmarkCatalogEnsureRegistration-8     1349           827            -38.70%

benchmark                                old bytes     new bytes     delta
BenchmarkCatalogEnsureRegistration-8     98384         69403         -29.46%

Bench vs without the update (current behavior):

benchmark                                old ns/op     new ns/op     delta
BenchmarkCatalogEnsureRegistration-8     26490         116646        +340.34%

benchmark                                old allocs     new allocs     delta
BenchmarkCatalogEnsureRegistration-8     276            827            +199.64%

benchmark                                old bytes     new bytes     delta
BenchmarkCatalogEnsureRegistration-8     13884         69403         +399.88%

~250us for a relatively intensive case (100 service instances on a single node) seems fine to me - it's slower than before by about 3x but that's not a huge deal.

And this is the worst case, update every result on the node on every change. In most update cases we can do better (see below).

2. Raciness and mutability

The bad news is that the code in my branch is racey. Because the denormalized results are stored in the ServiceNodes directly, the update I'm doing actually modifies objects in MemDB directly without copying and re-inserting them as part of a transaction.

That means that not only might our updates be seen before the transaction commits (and may even be made even if the transaction later aborts), but there might be readers actually reading that value right now and we are writing it without coordination.

3. Optimising which denormalised results are rebuilt

The branch currently rebuilds the results for all service instances on the node currently which is wasteful in many cases. It would be more efficient to do it with more granularity.

Initially I didn't want to do this because if we do it at the lowest level (e.g. in ensureCheckTxn, ensureServiceTxn) we'd end up duplicating the index lookups, service index increments etc.

The good news is - we already do all the hard work thanks to the per-service index work Pierre did - all the places that need to already lookup all the services on a node and update their indexes. So in a big-O sense we are already doing an equivalent amount of work on every update anyway!

The other good news is that we could optimise that somewhat easily by passing around the partial list of results that need updating.

4. Maintainability

I thought some about adding a generic way to do this in MemDB but it would be a larger change and I'm not sure we can do it generically in MemDB as efficiently as outside.

I think the biggest worry I have about the branch I made is just trying to be sure we caught every possible case but I think some exhaustive test cases can help there.

The other big win is in the previous section - if we can push the denormalised fixing down to the lowest level where we make the actual model changes then I think the risk of forgetting a code path is much lower.

Proposal

Given the raciness above we need to come up with an alternate solution that actually performs the denormalized result updates via MemDB transaction to keep them atomic and safe.

We could still keep them embedded in ServiceNode records but that's kinda gross anyway and makes the read-modify-write a little more cumbersome as we have other data to maintain too.

So I think we should:

I think the above will end up a bit slower than my benchmarks above because there will be added cost of copy and re-insert the results. We'll also need to think about if we need to optimize that for example is ensureNodeTxn, ensureServiceTxn and 5 calls to ensureCheckTxn in a single EnsureRegistration, we may be able to do better than 7 index lookups and 7 insertions. For example we could pass a map[string]structs.CheckServiceNode to each of those methods and have them just update a single mutable map of results and then write out all the changes at once at the end. The nicest part here is if we make passing that optional then the low-level functions can just do the update directly if the map is nil and so we are sure state is always correct by every code path without doing duplicate work.

Note: even though we are adding a new memdb table, we don't need to change snapshotting or add new Raft log operations since this is only an internal optimisation and the public interface for the FSM doesn't change - this makes it a lot simpler and safer - we can rip it out again later without backwards compatibility issues for example.

@Aestek is that something you'd be interested in working on? I suggest taking the catalog_bench_test.go from by catalog-bench branch as a starting point and then adding the above. Happy to give any more detail on it too if you want.

I'm relatively confident if we did that (which isn't that much work) we'd end up with a huge performance win with a still fairly maintainable/well tested implementation.

banks commented 5 years ago

@pierresouchay @Aestek interested to know your thoughts on the #5449. I think it solves the WatchLimit and number of goroutines issues completely with only a few lines of code changed and no new config - Pierre did all the hard work necessary already with service indexes! I realised this was possible while considering the denormalised version above.

If you agree I'd like to get that merged ASAP which should mean all health queries now only take one goroutine to watch and never hit watchLimit so we don't need #4986 to make it configurable for now. technically it fixes the issue reported here.

BUT I would still like to denormalise the result further to solve the rest of the CPU issue as outlined in my last message but this is an easy win to get in first.

ShimmerGlass commented 5 years ago

@banks On thing to note is that while your patch the most pressing issue (/health endpoint limited to 682 instance) the problem is still present on the /catalog endpoint with a limit of 2048 instances. This is still a problem for us as some of our clients use this endpoint for very larges services 3K+ instances.

We think this limit should still be configurable to allow for such use cases.

banks commented 5 years ago

@Aestek thanks for that testing. That's pretty much what I expected but great to confirm it!

It would be great if you have a chance to test the denormalised branch with your workload too despite the racyness issues I outlined before - just to gut check whether that is a viable alternative that would allow you to move away from the more complicated shared watch cache.

The fact that other queries can still hit the limit is true - I'm assuming that they are not such a big deal because there is not such heavy client usage and not such a heavy cost in rebuilding results. Can you confirm that Catalog with 3k nodes is actually causing you instability if the limit is hit? If so there is no real reason why the same approach wouldn't work there since any updates to services already go through a single index there is no real need to watch every single NodeService in the DB.

I'm still reluctant to make it configurable since that isn't really solving the problem for people and I don't think many will hit it in quite the same way you have. You don't just need huge services, but also huge churn and huge client blocking volume for it to really become a major issue it seems. It's hard to predict how turning it up would effect other blocking queries too etc. I'm not set against it, but actually fixing the queries to be more efficient seems like a better option to me that benefits everyone.

ShimmerGlass commented 5 years ago

Concerning the watch limit, I was looking for ways to cut down allocs in the Health.ServiceNodes code path and found that removing this limit allows to avoid fetching tx.Get("nodes", "id") and tx.Get("checks", "id"), leading to some significant performance improvements :

benchmark                        old ns/op     new ns/op     delta
BenchmarkCheckServiceNodes-8     3856          3255          -15.59%

benchmark                        old allocs     new allocs     delta
BenchmarkCheckServiceNodes-8     63             57             -9.52%

benchmark                        old bytes     new bytes     delta
BenchmarkCheckServiceNodes-8     1665          1473          -11.53%
Patch ``` diff diff --git a/agent/consul/state/catalog.go b/agent/consul/state/catalog.go index 254280c81..2d4eb0e42 100644 --- a/agent/consul/state/catalog.go +++ b/agent/consul/state/catalog.go @@ -2030,23 +2030,6 @@ func (s *Store) parseCheckServiceNodes( return idx, nil, nil } - // We don't want to track an unlimited number of nodes, so we pull a - // top-level watch to use as a fallback. - allNodes, err := tx.Get("nodes", "id") - if err != nil { - return 0, nil, fmt.Errorf("failed nodes lookup: %s", err) - } - allNodesCh := allNodes.WatchCh() - - // We need a similar fallback for checks. Since services need the - // status of node + service-specific checks, we pull in a top-level - // watch over all checks. - allChecks, err := tx.Get("checks", "id") - if err != nil { - return 0, nil, fmt.Errorf("failed checks lookup: %s", err) - } - allChecksCh := allChecks.WatchCh() - results := make(structs.CheckServiceNodes, 0, len(services)) for _, sn := range services { // Retrieve the node. @@ -2054,7 +2037,7 @@ func (s *Store) parseCheckServiceNodes( if err != nil { return 0, nil, fmt.Errorf("failed node lookup: %s", err) } - ws.AddWithLimit(watchLimit, watchCh, allNodesCh) + ws.Add(watchCh) if n == nil { return 0, nil, ErrMissingNode @@ -2068,7 +2051,7 @@ func (s *Store) parseCheckServiceNodes( if err != nil { return 0, nil, err } - ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh) + ws.Add(iter.WatchCh()) for check := iter.Next(); check != nil; check = iter.Next() { checks = append(checks, check.(*structs.HealthCheck)) } @@ -2078,7 +2061,7 @@ func (s *Store) parseCheckServiceNodes( if err != nil { return 0, nil, err } - ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh) + ws.Add(iter.WatchCh()) for check := iter.Next(); check != nil; check = iter.Next() { checks = append(checks, check.(*structs.HealthCheck)) } ```
ShimmerGlass commented 5 years ago

@banks We did some testing on https://github.com/hashicorp/consul/compare/x/bench-catalog?expand=1. and generated some load on the cluster. The versions compared are

We deployed A on one of the 5 server in the cluster and genrated ~700rps (Health.ServiceNodes) by making some service instance flap.

Version A

image

Version B

image

banks commented 5 years ago

@Aestek I read that as "both version ended up with about the same long term load average right? That's a good sign.

Do you have graphs of actual CPU usage between the servers you could share?

Thanks for testing this out!

ShimmerGlass commented 5 years ago

@banks

Version A

image

Version B

image

banks commented 5 years ago

Nice! I see A slightly lower which is great news although it's in the noise. But validates that this approach is at least as good as the sharing one as well as the other robustness and maintainability improvements!

pierresouchay commented 5 years ago

@banks SharedBlocking queries CPU are a bit less consuming for all servers, but the performance is close enough for us to use this instead of shared blocking queries

pierresouchay commented 4 years ago

@banks I am pretty sure https://github.com/hashicorp/consul/pull/4986#issuecomment-528283462 is linked as well

yiliaofan commented 4 years ago

consul version

Consul v1.4.0 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

The same problem occurred in the online production environment. Has the concurrent version been resolved?

Thank you very much ~! @banks @pierresouchay @Aestek

The flame chart at the time was as follows: https://github.com/yiliaofan/consul/blob/master/22.svg

pierresouchay commented 4 years ago

@yiliaofan no, we have a patch we have been using for more than one year on our production however if you are interested

yiliaofan commented 4 years ago

I really appreciate your work and can share it. I saw your discussion and research on the technical details above, and really admire your lot of work. For consul, I need to study the architecture and details, and look forward to being the same as you. Thank you! šŸ‘šŸ‘šŸ‘ @pierresouchay

pierresouchay commented 4 years ago

@yiliaofan thank you! This patch https://github.com/criteo-forks/consul/commit/4a6956511b3854010c6a3608c5f928cf102581d6 should fix your issue.

You might also have a look at my colleague's article https://medium.com/criteo-labs/anatomy-of-a-bug-when-consul-has-too-much-to-deliver-for-the-big-day-4904d19a46a4 that explains why this happens.

Have a good day

bofm commented 4 years ago

Could you please clarify if the limit of 681 service instances for a watched /health/service still takes place even if the service instances are watched with tags?

For example, if we have a service with 1000 instances where 500 instances have tag A and the other 500 instances have tag B and all service/health watch queries use either tag A or tag B, will we hit the limit?

pierresouchay commented 4 years ago

@bofm Starting from 1.7.0 (see https://github.com/hashicorp/consul/pull/7200), the limit is now 4 times that (8192 / 3 = 2730 instances). Not sure about the latest version, but in the previous version, tags filtering is applied after, so I would say yes (did not check)

banks commented 4 years ago

@bofm please note that the original issue with service health queries is no longer an issue as of Consul 1.4.4. In #5449 we optimised service health watches to only ever need a constant number of watches so the 681 limit doesn't exist any more.

The increase to 8192 in 1.7.0 was only added because other users were hitting the limit in other API endpoints such as KV or /catalog/nodes.

The only reason this issue was kept open was because we were discussing additional optimizations that could be applied which eventually led to the streaming work which is still ongoing but due to be completed soon.

Since this issue doesn't make that clear and the original issue of that limit has been fixed, I'm going to close this now but the streaming work is still coming šŸ”œ!