Allow defining preferred availability zone when querying multi-zone data

colega commented 2 years ago

Is your feature request related to a problem? Please describe.

When using multi-zone replication, the cross-zone traffic price is not negligible and can become prohibitive on the read path.

Describe the solution you'd like

Allow configuring the queriers to make preference for ingesters and store-gateways in their zone when performing query.

I think that we can achieve that by adding a configuration option to queriers (and rulers when they don't use remote evaluation) that would be used when selecting the store-gateways to query first: we just need a new preferredZoneBalancing strategy and sort the gateways list accordingly instead of shuffling it

If the preferred zone isn't healthy, we would fall back to a different zone, running the usual load balancing.

Describe alternatives you've considered

Pay the traffic price.
- Not paying the price is better than paying the price.
Create a network-level partition and disallow the cross-zone traffic from queriers
- Non happy cases (when ingesters/gateways are down, like during a release) would cause failed queries.

pracucci commented 2 years ago

I think the idea is neat. Assuming queries are well balanced across queriers in multiple zones, the impact on ingesters and store-gateways should be fairly balanced too.

Allow configuring the queriers to make preference for ingesters and store-gateways in their zone when performing query.

Just to be on the same page, we need to query ingesters in at least 2 zones in order to reach the quorum.

Bruno-DaSilva commented 2 years ago

Dropping a note here. rather than on the read path, but on the write path -- if you run a multi-zone kubernetes cluster the cost of inter-zone traffic between distributors and ingestors is enormous. It dwarfs the cost of cpu/memory/storage of our mimir cluster we were using for testing.

We ingest about 230k datapoints per second, which is about 22 MB/s incoming to the distributors. THEN, we see a whopping 550 MB/s incoming into ingesters. The inter-zone networking cost for this (kube cluster is in 3 zones) was ~$400/day, which was over half the total cost of the cluster. Clearly the alternative of "just pay for it" isn't a very reasonable one here lol. And the only other alternative I see is to go single-zone, which isn't a great option either for resiliency.

Obviously, you go multi-zone for resiliency in the face of a zone outage. Mimir has zone awareness on the stateful components to allow an entire zone to go down - and thats great! It'd be nice if we could take that into account when distributing writes to ingester nodes.

One caveat, because the bandwidth between distributors and ingesters is so high, you'd still be seeing a ton of traffic across zones with zone awareness if replicationFactor > 1. So; perhaps some sort of replication that happens at the distributor side instead of at the boundary might be needed (ie. replicating the 22MB/s traffic, instead of the 550MB/s traffic). I'll leave that up to maintainers to think about :)

Bruno-DaSilva commented 2 years ago

For context: we’re currently on VictoriaMetrics, and we were exploring mimir as a potential cost savings. For multi-zone, we replicate the entire VM cluster 3 times for 3 zones and combine queries together in case one zone is missing data/down (see: promxy, but now built-in to vmselect)

I was hoping Mimir’s architecture would mean we could just replicate on the write path and then cut costs on the read path since we don’t need full 3x duplication (it can just recover by deploying more after a zone failure). But… looks like that wont work because of this.

Not a huge deal for us, but just fyi that it’s blocked our potential adoption.

pstibrany commented 2 years ago

which is about 22 MB/s incoming to the distributors. THEN, we see a whopping 550 MB/s incoming into ingesters.

You can reduce this by using compression between distributors and ingesters. Problem is that request coming to distributors is compressed (snappy-compressed protobuf), but in default Mimir configuration distributor sends samples to ingesters uncompressed, plus it also replicates it 3x. This can be changed using -ingester.client.grpc-compression=snappy or gzip.

Bruno-DaSilva commented 2 years ago

@pstibrany this is very good to know. We will likely continue playing with this at some point, then.

cc @56quarters since we had this discussion in slack.

valyala commented 1 year ago

For context: we’re currently on VictoriaMetrics, and we were exploring mimir as a potential cost savings. For multi-zone, we replicate the entire VM cluster 3 times for 3 zones and combine queries together in case one zone is missing data/down (see: promxy, but now built-in to vmselect

You can achieve the same level of availability and consistency with VictoriaMetrics as defined in this feature request by substituting promxy with an ordinary http load balancer with health checks (e.g. an ordinary nginx). It will spread incoming query load among available AZs, while it will stop sending queries to AZs as soon as they will become unavailable. In this case every incoming select query will be routed only to a single AZ, thus saving both compute resources (CPU + RAM + network) and costs comparing to the case with promxy, which sends every incoming query to all the AZs.

grafana / mimir