Consolidated watches via http/2?

anthonybishopric commented 8 years ago

We are using Consul extensively in P2 and enjoying the feature set. Our current infrastructure includes several thousand machines running over a thousand services (in staging and production modes). Each host has one or two long-lived watches against Consul for tracking deployment requests. We additionally have ~10 central master servers that maintain long-lived watches for our replication controller types (a concept we have borrowed from Kubernetes).

As our infrastructure and number of service deployments rise, we are starting to worry about the possibility of exhausting the ephemeral port space on the current Consul leader. N thousand host agent watches + M thousand replication controller watches may eventually fill this space. It also makes it problematic to add new features to P2 that require watching a new part of the subtree once per application we have.

It is our understanding that a recursive watch query against a keyspace will always return every returned value, which is not sustainable for our very large trees, hence one watch per sub-path. It would be great if we could at the very least aggregate the result of these watches into a single network connection that could be owned by one of our master servers.

The alternative to this request is to provide differential watches, ie only return the keys under a recursive query that changed, not all of them, which is something that etcd does support.

Edit: we aren't actually worried about file descriptor / port exhaustion on the Consul leader since these are all inbound connections, but we are generally worried about our master servers.

slackpad commented 8 years ago

Hi @anthonybishopric we actually multiplex all the traffic from a given Consul agent back to a Consul server over a single TCP connection (using the yamux library), so the number of ports used on your servers should scale with the number of nodes in your cluster, but not with the number of watches. Do you run the Consul agent on each of your nodes?

anthonybishopric commented 8 years ago

@slackpad thanks for the reply! To answer your questions and follow up:

No, we do not use Consul agents - just the servers. Unfortunately the gossip protocol really conflicts with our network ACL policy and we couldn't figure out a way to make them work, especially given that we wanted a single logical Consul cluster for our various datacenters. We implemented our own agent that does our deployment work without the gossip protocol and uses the HTTP API for updates and watches.

Is the multiplexing nature of Consul agents owed to the fact that they use the net/rpc package to communicate with servers? Is that API intended to be used by clients ever? If not, it seems like it might be very simple to add a GRPC wrapper for those same endpoints, with some established API contract and protos. If this is acceptable to you guys, we may have a PR coming your way.

slackpad commented 8 years ago

@anthonybishopric those underlying multiplexed interfaces are really intended to be internal to Consul so I don't think we'd expose or wrap them. I think the best way forward with this issue would be to look at improving the KV watch/blocking query capability which would benefit everyone. I'll do a little digging along that line.

slackpad commented 8 years ago

mpuncel commented 7 years ago

I'd like to bring some attention to this issue again.

Our consul deployment has a similar access pattern as described in the email list where we have numerous agents that do recursive watches on kv subtrees with around 1000 keys in them, each of which are a few kilobytes. This means a very small fraction of the data returned for each query actually represents a change we care about. Watching individual keys is difficult because we want to react to new keys being created in addition to existing keys changing.

At this point our biggest scalability concern for our consul deployment is its bandwidth usage which peaks around 1Gbps (we have around 300Mb of data in consul). We've already made some optimizations such as running a sidecar HTTP server that pulls data locally to a consul server and returns a filtered (much smaller) dataset which bought us some breathing room.

There are some optimizations we could make with the way we store data (for example, we could store pointers to data instead of the data itself to reduce how much is returned from a watch and then dereference the pointers only if they've changed) but if improvements can be made to consul additionally that would be very helpful.

slackpad commented 7 years ago

Hi @mpuncel I'll pull this forward so we can get it into an upcoming release. I think there's a pretty simple way to address this for KV to optionally send the deltas. We will need a little extra stuff to show that a key was deleted, but if we can figure out a good way to do that, the rest is a pretty simple filter.

mpuncel commented 7 years ago

Thank you for the response!

You're right that deletion is very tricky and there would have to be some kind of limit on how long deletion information is kept around I imagine.

After my last post I realized you can do watches on just the set of keys, which means we can do a single watch on a subtree for the keys and then start/stop goroutines that do individual watches on the values. We'll still be getting a lot of key turnover that we don't care about but the set of keys is much much smaller than the set of values.

mpuncel commented 7 years ago

@slackpad What if there was an equivalent to the ?keys query that gave additional metadata about each key like modify index? That would make it a lot easier to build a client that watches keys and modify index and then re-fetches the value for keys whose index changed? That sounds like it would be easier to implement

iandyh commented 7 years ago

@slackpad Hello. Is there any release schedule for the feature? It would be really helpful the blocking query on prefix only return keys that are changed.

slackpad commented 7 years ago

I think I'll close this in favor of #2791 since that's got a more concise description of what we would probably end up implementing, and this issue started in a different place.

hashicorp / consul

Consolidated watches via http/2? #1815