Add Feature to Favor a Single AZ for Ingest

disfluxly commented 1 year ago

Describe the solution you'd like

In a similar vein to https://github.com/grafana/mimir/issues/2284 and the current AZ Awareness, I'd like to have the ability to effectively tell Mimir's Distributors to prefer writing to Ingesters that are located in the same AZ as itself.

I'd like to also then have the ability to effectively flip the switch to have Distributors write to different AZs in the event of an AZ outage. This would both allow for massive network cost savings on inter-az traffic for the 99.999% of the time when all AZs are available, but provide the ability to also semi-quickly adjust for an AZ failure (Understanding that additional Ingester scaling may be needed at the time the flip is switched in the non-impacted AZs, and that some data may become inaccessible on the impacted AZs).

Describe alternatives you've considered

Deploying everything into a single AZ. Unfortunately this isn't a very fault-tolerant solution and can quickly lead to IP exhaustion.

Additional context

Looking for the best of both worlds here! :)

Currently, even with snappy compression on distributors->ingesters, the ingest process makes up 99% of our inter-az network costs. Having this preference could save companies a lot of money.

disfluxly commented 1 year ago

Related to a discussion in Grafana Slack here: https://grafana.slack.com/archives/C039863E8P7/p1667947721073989

pracucci commented 1 year ago

In a similar vein to https://github.com/grafana/mimir/issues/2284 and the current AZ Awareness, I'd like to have the ability to effectively tell Mimir's Distributors to prefer writing to Ingesters that are located in the same AZ as itself.

Which replication factor are you using?

disfluxly commented 1 year ago

In a similar vein to #2284 and the current AZ Awareness, I'd like to have the ability to effectively tell Mimir's Distributors to prefer writing to Ingesters that are located in the same AZ as itself.

Which replication factor are you using?

Currently I'm on Cortex, planning to swap to Mimir in the coming months. But running replication factor of 3, 48 ingesters spread across 3 azs.

pracucci commented 1 year ago

Currently I'm on Cortex, planning to swap to Mimir in the coming months. But running replication factor of 3, 48 ingesters spread across 3 azs.

Got it. I was asking it because when you deploy Mimir in multi-zone and you have replication factor = 3, then each series is replicated to 3 ingesters each in a different zone. Given this, I'm not sure I understand how you would like the replication with a "favorite AZ" to work.

disfluxly commented 1 year ago

Currently I'm on Cortex, planning to swap to Mimir in the coming months. But running replication factor of 3, 48 ingesters spread across 3 azs.

Got it. I was asking it because when you deploy Mimir in multi-zone and you have replication factor = 3, then each series is replicated to 3 ingesters each in a different zone. Given this, I'm not sure I understand how you would like the replication with a "favorite AZ" to work.

This wouldn't work in tandem with the cross-zone replication setting, but rather would be something alternative. A user wouldn't be able to have both "cross-zone-replication" and "favor-zone-replication" enabled at the same time.

The goal would be for the distributors to only write to ingester replicas in the same AZ. IE - if distributor-1 is in AZ2, replication factor == 3, then that distributor will write to 3 ingesters also in AZ2.

In my case of having 48 ingesters spread across 3 AZs equally (16 in each AZ), then each distributor will write to 3/16 ingesters in the same AZ as itself.

The purpose is to save money on cross-az network charges, where the ingesters currently make up 99% of my cross-az network usage.

The other request in here is the ability to flip the switch from a "favor-zone-replication" model to the cross-zone-replication model. For example, if AZ2 goes down while I'm in an "favor-zone-replication" mode, then obviously Mimir/Cortex grinds to a halt. Effectively I'd have 16/48 ingesters offline. What I'd like to do is scale out 16 more ingesters on AZ1 and AZ3, then flip the switch to cross-zone replication, and have Mimir then start to write cross-az. Once the AZ is back online, have the ability to flip the switch back to a favored model.

Totally understand that the risk here is that I won't have access to any of the recent data in AZ2, and I'll need to manually scale back in the ingesters once AZ2 is back online. However, the cost savings, at least for my organization, greatly outweigh that risk, especially if it's partially mitigated by the ability to continue ingesting data after moving temporarily over to the cross-zone replication.

pracucci commented 1 year ago

The goal would be for the distributors to only write to ingester replicas in the same AZ. IE - if distributor-1 is in AZ2, replication factor == 3, then that distributor will write to 3 ingesters also in AZ2. In my case of having 48 ingesters spread across 3 AZs equally (16 in each AZ), then each distributor will write to 3/16 ingesters in the same AZ as itself.

What happens when AZ2 fails? All (or several) ingesters in AZ2 will be unhealthy, so all queries will start failing. If queries won't fail, then you will get partial query results because metrics stored in the failing ingesters (AZ2) won't be available for reading. So, even if Mimir would be deployed in multiple availability zones, it won't be more reliable than deploying Mimir only in 1 zone, cause you would need to wait for AZ2 to get up again in order to guarantee query results correctness.

nickvanwegen commented 1 year ago

Hi I also struggle with this topic at the moment. I have mimir installed on our platform and cross-az data transfer cost are getting pretty high.

My thought is can https://kubernetes.io/docs/concepts/services-networking/topology-aware-hints/ help here? Lets say you have 3 ingesters per AZ and hints would be enabled on the mimir services it would prefer routing traffic to the same az? Thus the 3 ingesters in the same AZ. Maybe utilizing hints in a different way by mimir could also help this issue. Since hints are already fault "guaranteed" with the safeguards

disfluxly commented 1 year ago

What happens when AZ2 fails? All (or several) ingesters in AZ2 will be unhealthy, so all queries will start failing. If queries won't fail, then you will get partial query results because metrics stored in the failing ingesters (AZ2) won't be available for reading. So, even if Mimir would be deployed in multiple availability zones, it won't be more reliable than deploying Mimir only in 1 zone, cause you would need to wait for AZ2 to get up again in order to guarantee query results correctness.

This is a risk that I think many would be willing to take in order to realize the cost-savings benefits, and I think it's still a better situation than placing all resources on a single AZ and having that single AZ fail. Partial results in the event of an outage are still better than no results. In addition, I think quite a bit of those risks could be mitigated by implementing a switch to move back and forth between "favor-zone-replication" and "cross-zone-replication" as I indicated above.

My thought is can https://kubernetes.io/docs/concepts/services-networking/topology-aware-hints/ help here? Lets say you have 3 ingesters per AZ and hints would be enabled on the mimir services it would prefer routing traffic to the same az? Thus the 3 ingesters in the same AZ. Maybe utilizing hints in a different way by mimir could also help this issue. Since hints are already fault "guaranteed" with the safeguards

I think most of the logic already exists to favor the same zone for replication, since the "cross-zone-replication" option exists. The logic would just need to be copied and then updated to tell the distributors to write to ingesters in the same AZ as itself, rather than to ingesters in different AZs. The distributors would also themselves need to have the logic to know their AZ, which also already exists.

ajpauwels commented 1 year ago

I've been looking at this topic for the past week now and I think this might be doable by deploying things in an AZ-segregated way and then taking the cross-AZ cost hit in the read path. Such a solution would not just keep Distributor->Ingester traffic AZ-local, but also Workload<-Agent->Distributor traffic. However, as already mentioned above, there are availability tradeoffs to this solution.

I'll start by how I think this would work:

1+ Grafana Agents are deployed per-AZ and configured to scrape endpoints in their AZ only
1+ Grafana Distributors are deployed per-AZ; each AZ-specific Distributor group receives a separate, zone-specific headless service
3+ Grafana Ingesters are deployed per-AZ; each AZ-specific Ingester group receives a separate, zone-specific headless service

The Grafana Agent for an AZ is configured to remote_write to the zone-specific Distributor headless service for its AZ. Similarly, the Distributors for an AZ are configured to shard/replicate across the 3+ Ingester instances in its AZ. From there, data goes to object storage.

This should keep all data flows AZ-local on the write path.

However, the problem now is on the read-path. Assuming you somehow figure out topology-aware routing and can keep requests coming in to a particular query-frontend on schedulers and queriers in the same AZ, those queriers will inevitably HAVE to make cross-AZ requests to get recent data in other zones. Obviously, long-term stored data would be shared across all zones.

There's a lot of what-ifs in this solution though:

Would this be deployed as three separate Mimir clusters with three separate memberlists/gossip rings etc, or as one Mimir cluster where subsets of it only communicate with other subsets? The former would imply SOME amount of cross-AZ costs as the memberlist gossips between itself.
What are the implications of having distributors for tenant A in AZ1 send to three ingesters that are different than the ones for tenant A in AZ2? Can all the ingesters across all zones still share the same object storage bucket (when they're on the same memberlist or on different memberlists)?
Would it be possible to organize some sort of out-of-band replication of the underlying stateful storage used by the ingesters? For example, instead of having the distributors perform zone-aware replication, could I instead implement replication by performing some sort of cheap, bulk replication of the underlying EBS store in one zone to another?

One could also adopt a middle ground that minimizes cross-AZ costs without all the additional complexity mentioned above. You could keep workload/agent/distributor traffic AZ-local, take the hit on the write path and have the distributors perform zone-aware replication to the ingesters WITH snappy compression enabled, and then keep the read path all single-AZ as well.

disfluxly commented 1 year ago

I've been looking at this topic for the past week now and I think this might be doable by deploying things in an AZ-segregated way and then taking the cross-AZ cost hit in the read path.

I think a lot of what you said in your post is a good idea, but I do think that in regards to this specific feature request it might be attempting to boil the ocean.

From this issue's perspective, I think it should remain isolated to just figuring out inner-Mimir connectivity for same zone distributer -> ingester ingestion to avoid cross-az network charges, and what the pitfalls are in a solution like that.

I think things like:

Having same-zone preferred querying
Adding same-zone preferred ingestion to the Grafana Agent

should be two separate feature requests. It's all linked, but easier to keep discussions on point if we separate out those pieces.

nickvanwegen commented 1 year ago

I agree with @disfluxly the distributor -> ingester and ingester -> ingester (Also querier can have a lot of traffic but I am not sure how this flow goes) is where the biggest amount of data transfer is happening at the moment, so would be nice if this could remain a scoped feature.

pracucci commented 1 year ago

Partial results in the event of an outage are still better than no results.

I think this is the main divergence here. So far, Mimir has been designed to guarantee query results correctness. Either a query succeed (and it's guaranteed to return correct query results, unless unknown bugs) or the query will fail.

I personally believe this is an important property for a monitoring system. During an outage (e.g. an AZ is down), as on-call operator I want (and need) to trust the monitoring system. While handling an incident, I want to look at the dashboards and I want to have the certainty that what I'm seeing is true. I don't want to have the doubt "maybe I'm just seeing partial data, the actual state of my infrastructure is different".

disfluxly commented 1 year ago

I personally believe this is an important property for a monitoring system. During an outage (e.g. an AZ is down), as on-call operator I want (and need) to trust the monitoring system.

This still achievable and there is a best of both worlds scenario.

For example, I run two separate clusters in different regions. One in us-east-1, another inus-west-2. I ingest all metrics into both regions.

Normally I weight queries 50/50 across regions. If an AZ in us-east-1 goes down, I simply route all queries to us-west-2, which has a complete set of data. My on-call operators won't even know.

This is one way to mitigate that problem (and many other potential ones). The challenge is still wanting to reduce the cost of each cluster. Sure, I could put Mimir into a single AZ in each region, however, that still has more risk than I'd like to bear. As a reference point, when AWS had a prolonged outage over Thanksgiving a couple of years ago in us-east-1, us-west-2 started showing instability. Why? Because everybody who had workloads running in us-east-1 were scrambling to migrate those workloads to us-west-2. When an AZ or Region goes down, the risk factor of all remaining AZs/Regions goes up. So I'd still like to have a multi-az setup in each region.

If you run a single-region cluster, then yes, I absolutely think you should be using cross-az replication, or highlighting the risks to your users.

If you run a multi-region cluster, then I think there should be options that will help to reduce inner-cluster costs since you've already taken an approach to mitigate many potential issues.

MrFreezeex commented 1 year ago

If you run a multi-region cluster, then I think there should be options that will help to reduce inner-cluster costs since you've already taken an approach to mitigate many potential issues.

Hi @disfluxly :wave:, this might be slightly off topic but what do you use as a backend storage for a cross region cluster? Do you use the buckets from one region in your cluster or do you use it from multiple regions? AFAIK you can setup cross region replication on s3 bucket but this will be done in async fashion with eventual consistency, not sure mimir would behave correctly with that kind of behavior...

disfluxly commented 1 year ago

@MrFreezeex - I use an S3 bucket in each region. Each S3 bucket has the full set of blocks. In the event that a region goes down and is thus missing blocks, I simply copy over the blocks for the time period from the other s3 bucket. The compactor then takes care of the rest.

MrFreezeex commented 1 year ago

@MrFreezeex - I use an S3 bucket in each region. Each S3 bucket has the full set of blocks. In the event that a region goes down and is thus missing blocks, I simply copy over the blocks for the time period from the other s3 bucket. The compactor then takes care of the rest.

Wait wait, so in your cross region cluster you have a single Mimir cluster over multiple AWS region with one Mimir zone = one AWS region and each zone have a separate s3 bucket for blocks without any S3 bucket level replication other than the occasional "manual" copy you do on a downtime?

I was not expecting the ingester to gracefully handle multiple storage backend on different zones like that at all... Is this something officially supported or it just happens to work (or am I misunderstanding something and you are not doing that at all)?

disfluxly commented 1 year ago

Wait wait, so in your cross region cluster you have a single Mimir cluster over multiple AWS region with one Mimir zone = one AWS region and each zone have a separate s3 bucket for blocks without any S3 bucket level replication other than the occasional "manual" copy you do on a downtime?

No, not quite. I have 2 completely separate Mimir clusters. One in each region (us-east-1/us-west-2). They do not know about each other. In my ingest pipeline, I write all metrics received to both clusters.

Each cluster has its own S3 bucket. If, for some reason, us-east-1 were to go down for 3 hours, I can simply copy the blocks from my us-west-2 S3 bucket to my us-east-1 S3 bucket for that 3 hour period, and the compactor will index them for the store gateways to pick up. At the end of the day, they're just blocks, and they aren't tied to a specific cluster for any reason. This is why you can upgrade from something like Cortex to Mimir without losing your long-term blocks.

algo7 commented 2 months ago

Any news?

gespi1 commented 5 days ago

bump 👀

grafana / mimir