Global deployment model

fabxc commented 6 years ago

Current test deployments within a single cluster/region work very well. Time to give global deployment options some thought.

Generally speaking, a full setup in every region is probably desirable for reliability and data locality for almost all setups. It also provides trivial means of sharding, which isolates failures and increases scalability.

Currently there seem to be two options:

A) Expand the gossip network globally. All store node, Prometheus servers, and queriers are interconnected.

Pros:

Seamlessly applies our existing model to global scale.
Virtually not additional code/logic
Conceptually works equally well with a single object storage bucket as well as a bucket per region or other splits.

Cons:

Probably need some gossip fine-tuning to support WAN latencies. That would decrease responsiveness to changes in LAN. Probably no big issue overall.
Probably operational complexity for most users (routing and securing gossip traffic)
Do we need/want the same strong reliability properties at global scale?

B) Keep Thanos cluster regional or even more fine-grained. Additional global federation layer across smaller Thanos clusters.

Query nodes would be made aware of each other through regular service discovery mechanism and act as federation proxies for the Store API.

Pros:

Cross-cluster communication is a lot simpler. Probably most organisations have appropriate infrastructure in place.
Query nodes becoming Store API providers for their cluster seems generally elegant.
Extensible for distributed query evaluation.
Since it's basically an extension, we don't block anyone from doing A). Modulo some potential configurables we are lacking.

Cons:

Federated queries are no longer independent from the external service discovery system
Additional logical layer adds code and mental overhead
Conceptually only makes real sense with full Thanos clusters with dedicated buckets in each region.

fabxc commented 6 years ago

I felt like it would be good to specify the federation approach in a bit more detail to make a decision here: https://docs.google.com/document/d/1-hXTQ3dSFA1yNiUrWCMFqkW84k6PicZK-9tMhTssKN0/edit

bwplotka commented 6 years ago

As discussed there is also: B2) Thanos clusters are still regional or even more fine-grained. On top of local querier there is additional global query layer across smaller all Thanos clusters.

Local query nodes are NOT aware of other clusters or global node. It only exposes additional the Store API. This model is closer to hierarchical Prometheus federation

Pros:

Same as B1 + even more cross-cluster communication reduction
No need for client LB ("which query should Global Grafana to call?" in option B2)
Logically Simpler. Local queriers are regionally sharded and isolated from each other.

Cons:

Same as B1, but also:
Additional component (global query) adds complexity.
We need HA for global query to avoid SPOF again -> even more complexity and maintenance.
We need special ruler with global rules

peterbourgon commented 6 years ago

I was asked to give some thoughts here, sorry for the delay! To me it seems both easiest and cleanest to have all nodes join the same gossip network (option A), without any sense of hierarchy or federation. Scoping queries to e.g. sites or regions is, to my mind, a query-time operation, and so can be like any other decision made by the query handlers, based on per-node metadata it's already received and cached via gossip.

To me, a federation/hierarchy feels like a sort of performance optimization. Without concrete justification that it's necessary, I'd try to avoid the complexity it would introduce.

This is based on several assumptions:

That Thanos, like OK Log, uses gossip purely for membership and basic node metadata
That it's possible and reliable to configure e.g. memberlist to work at this scale
That users will at least sometimes want to query Thanos at global (complete) scale
Operational details like firewalled port ranges etc. aren't problematic enough to optimize for

bwplotka commented 6 years ago

Thank you for you input @peterbourgon! Totally agree, that with these assumptions, option A is perfect. However, at this point we are looking on federation because we are not really sure if below two assumptions you mentioned are always true (for all potential users of the Thanos):

That it's possible and reliable to configure e.g. memberlist to work at this scale

Operational details like firewalled port ranges etc. aren't problematic enough to optimize for

Especially when we think of the clusters from totally different geographical regions (like EU vs Asia) or some proxy-based cross-cluster communication like istio or kedge where cross-cluster communication requires bit more configuration (and RTT time).

mattbostock commented 6 years ago

There's maybe an option (C); use the Prometheus service discovery library to discover peers instead of using the memberlist library.

Pros:

shifts all the burden of membership to the service discovery mechanism used (tuning the gossip protocol for latency would no longer be a concern for this project)
users can use whatever mechanism suits them best

Cons:

makes the configuration more complex (e.g. relabelling)
introduces a lot of dependencies (but they would be mostly managed upstream)
you'd lose the dead node detection provided by memberlist (this problem would be pushed to the service discovery mechanism chosen)

fabxc commented 6 years ago

Thanks @mattbostock. We have considered using Prometheus's SD package. But you are pretty much spot-on with the con's you listed.

Prometheus's SD is only really useful if there's meaningful metadata you need to extract from your service discovery information (e.g. building target labels in Prometheus). In Thanos we do have labels for Prometheus instances – but those are directly extracted from their external_labels config section. Arguably moving this critical information to a loosely connected SD would be asking for trouble for lots of users in the end.

For basic discovery DNS is be mostly fine and in practice provided on top of any more sophisticated discovery mechanism anyway. DNS is basically allowed now through static flag-based configuration of additional data sources in the querier.

Right now Thanos is dead simple to configure, which is largely thanks to a lack of config files. Adding those with relabeling rules would change that immediately.

mattbostock commented 6 years ago

@fabxc: I agree, I think using the Prometheus service discovery library would add unnecessary complexity.

Echoing @peterbourgon's comment above, I think we should avoid an additional 'federation' layer in the interest of keeping things simple.

Store instances can be configured statically, which I think resolves most of this issue? The deployment model I'm thinking of is:

sidecar instances implementing the store API are deployed to datacentres where the data is being generated (where Prometheus is)
query instances (and S3/GCS store instances) are deployed closest to where users consume the data (e.g. where Grafana runs)

In this scenario, the store instances can be configured 'statically' (e.g. using confd) as part of the command-line options for the query nodes. Additionally, we should support cross-WAN cluster communication. I suggest opening a separate issue to track that.

However, at this point we are looking on federation because we are not really sure if below two assumptions you mentioned are always true (for all potential users of the Thanos)

We can always add this later when/if the use case arises.

swsnider commented 6 years ago

@fabxc The only reason I'd be pro re-using the prometheus SD library is so that we get things like DNS-SD discovery (i.e. SRV records) for free -- it's my understanding that the existing binary can only do A record queries ATM for DNS-based SD?

bwplotka commented 6 years ago

Yea, broadly speaking, setup we ended up is similar to what you said @mattbostock

Each environment:

have scrapers in each cluster (Prometheus + thanos sidecar) in 2-replicas HA
have queries, compactor, store node, one scraper, ruler in "monitoring" cluster. The same where alertmanager & dashboards are.

Monitoring cluster have thanos components connected via gossip. Queriers on this cluster have statically configured scrapers from remote clusters that are connected through the proxy (kEdge), since there is no other connection (vpn) between them. That's why I said there are some cases where gossip is not possible to use. This configuration is fine for now, because we don't really need "automated cluster discovery", so no SD needed.

Additionally, we should support cross-WAN cluster communication. I suggest opening a separate issue to track that.

What do you mean by that? What exactly would you like to have for that? I have cross-WAN by using some external proxy service, so no Thanos change was required.

However, we do want some federated global layer of queries to be on top of all environments and allow global view across envs. This can be done using static --stores query flag + my proxy in my case, and seems to be good enough for now.

bwplotka commented 6 years ago

@swsnider for peers, gossip flow needs an initial list of members. Either raw IP:port or domain:port. In case of latter the DNS lookup for all IP is done: https://github.com/improbable-eng/thanos/blob/master/pkg/cluster/cluster.go#L355

We had SRV lookup there as well, but we found it too complicated for an actual value. This proven to be sufficient for all use cases we had.

fabxc commented 6 years ago

Yea, I think adding DNS-SRV would actually be still reasonable if there's a strong use case for it. It would just be a few lines rather than pulling in the massive Prometheus SD framework and all its deps.
This is only helpful for initial discovery of peers though, much like in Prometheus Alertmanager. Arguably we are already providing a better experience for that than AM did in the past. Generally, one can always startup with a small script that pulls initial peer info before starting the Thanos component

mattbostock commented 6 years ago

Additionally, we should support cross-WAN cluster communication. I suggest opening a separate issue to track that.

What do you mean by that? What exactly would you like to have for that?

@Bplotka: By cross-WAN cluster communication, I mean the ability for cluster peers to communicate and discover each other across a WAN, using appropriate timeouts such as: https://godoc.org/github.com/hashicorp/memberlist#DefaultWANConfig

This would be an alternative to using the static --store flag when your stores are located on the otherwise of a WAN. I don't yet have thoughts on how that would be implemented.

However, we do want some federated global layer of queries to be on top of all environments and allow global view across envs. This can be done using static --stores query flag + my proxy in my case, and seems to be good enough for now.

By 'we' do you mean the Thanos project, or Improbable?

bwplotka commented 6 years ago

@Bplotka: By cross-WAN cluster communication, I mean the ability for cluster peers to communicate and discover each other across a WAN, using appropriate timeouts such as (...)

Ah. Yea, I don't see any problem with changing Thanos to allow setting WAN defaults for gossip if you wish to setup WAN gossip, good point.

By 'we' do you mean the Thanos project, or Improbable?

Sorry for confusion (: I meant only Improbable use case here.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

thanos-io / thanos

Global deployment model #163