Setting to automatically forget unhealthy ring members

RECturtle commented 2 years ago

Is your feature request related to a problem? Please describe.

Yes, when running in ECS if a container goes down, it remains as an unhealthy member in the memberlist. It requires us to manually go in and remove the member in the UI to resolve.

Describe the solution you'd like

An option to set that automatically forgets unhealthy members from rings (ingester, distributor, etc.)

Describe alternatives you've considered

Manual alerting to notify us when manual forgetting needs to occur.

Additional context

This is similar to the feature in Loki: https://github.com/grafana/loki/pull/3919

Andysimcoe commented 2 years ago

Same problem, I thought I was missing a config option somewhere.

If you're curious that's what we'll see, they will just constantly keep building up, which maybe doesn't sound like a problem but if you don't catch them the distributor fails even with enough healthy ingesters the distributor just stops.

pstibrany commented 2 years ago

Note that "unhealthy" distributors in the ring are not a problem for running Mimir, and Mimir 2.2 already includes functionality to automatically remove distributors from the ring: #2154

There is no such functionality planned right now for ingesters. We suggest to address this by 1) making sure ingesters don't crash, and 2) make them reuse the hostnames between restarts. That way they also reuse existing ring entries.

Andysimcoe commented 2 years ago

I don't think relying on the ingesters never crashing is a great option, sure I'll look at why but in such an environment I wouldn't rely on even very mature software to never crash.

I had to get quite inventive to have ingesters re-use a directory on the EFS storage as it's a bit different to running in K8s, so you're saying @pstibrany if instance_id is the same it'll just re-use the ring entry. I think that might be doable...

wilfriedroset commented 2 years ago

There is no such functionality planned right now for ingesters. We suggest to address this by 1) making sure ingesters don't crash, and 2) make them reuse the hostnames between restarts. That way they also reuse existing ring entries.

This would be helpful, especially when running mimir on baremetal. ingesters do use the hostname between restarts however, restart (after a upgrade of mimir for example) or reboot (kernel upgrade) can take some time and they can become unhealthy which cause mimir to note accept samples the time being.

On our end we are evaluating an automation which would automatically remove unhealthy ingester. The gist of this automation is to add the support for a json output to any /ring routes via a query parameter ?json=True. With this information the automation execute the call to remove all unhealthy ingesters something like

curl -sk -X POST https://distributor.mimir.service.consul:9009/ingester/ring -d "forget=mimir-202-ingester-2.tld"

pstibrany commented 2 years ago

On our end we are evaluating an automation which would automatically remove unhealthy ingester. The gist of this automation is to add the support for a json output to any /ring routes via a query parameter ?json=True. With this information the automation execute the call to remove all unhealthy ingesters something like

If you send Accept: application/json to ring pages, they send JSON output.

pracucci commented 2 years ago

The reason why there's no auto-forget for ingesters is that if an unhealthy ingester is forgotten, then queries will succeed and return partial results instead of failing (Mimir was designed around guarantee query results correctness, so either the query succeed and it's guaranteed to be correct, or it fails).

I'm not saying that your feedback is wrong, I actually want to learn more about your use case. What scenario would you like to solve with the auto-forget feature of ingesters?

Note: we do support auto-forget for other components (e.g. distributors, store-gateways, ...) where it's safe to auto-forget an instance in the ring.

This would be helpful, especially when running mimir on baremetal. ingesters do use the hostname between restarts however, restart (after a upgrade of mimir for example) or reboot (kernel upgrade) can take some time and they can become unhealthy which cause mimir to note accept samples the time being.

If you run Mimir in a single zone, you should restart 1 ingester at a time. If you run Mimir with multi-zone replication, you can restart 1+ ingesters in the same zone at a time.

wilfriedroset commented 2 years ago

If you run Mimir in a single zone, you should restart 1 ingester at a time. If you run Mimir with multi-zone replication, you can restart 1+ ingesters in the same zone at a time.

We are running mimir with multi-zone replication and we roll-out restart with great care of the ring.

I'm not saying that your feedback is wrong, I actually want to learn more about your use case. What scenario would you like to solve with the auto-forget feature of ingesters?

I'm anticipating network failure between hosts running Mimir or complete lost of the instance. Unlike k8s deployment, the instance(s) will not be respawn automatically somewhere else. This means that between the lost of the instance and the fix we are subject to unhealthy ring.

RECturtle commented 2 years ago

@Andysimcoe If you're able to set the same instance id for a new container that's spinning up to replace the unhealthy one, please post here. Would definitely be interested in seeing how that could work.

Andysimcoe commented 2 years ago

@RECturtle I'm planning on overriding the entrypoint with a script which determines the instance_id and then sets it as a flag on the mimir binary, which will largely be find a directory on the EFS FS which only the ingesters use and take that name.

How are you handling persistent storage for the ingesters? The answer may lie there.

pracucci commented 2 years ago

I'm anticipating network failure between hosts running Mimir or complete lost of the instance. Unlike k8s deployment, the instance(s) will not be respawn automatically somewhere else. This means that between the lost of the instance and the fix we are subject to unhealthy ring.

The current behaviour:

1 ingester is unhealthy: reads and writes are still successful
deploying Mimir in multi-zone and all ingesters in 1 zone are unhealthy: reads and writes are still successful
other failure scenarios (2+ unhealthy ingesters/zones): reads and writes fails

other failure scenarios (2+ unhealthy ingesters/zones): reads and writes fails

Automatically forgetting ingesters in this scenario will lead to partial query results. Maybe it's acceptable, maybe not. In Mimir we deliberately chose to not accept it, because we want to trust the monitoring system. If a query succeed, we want to trust the query result is correct.

What's about the write path? The write path could continue to work even if 2+ ingesters are unhealthy, basically sharding series among the healthy ones. That's not what we currently do, because if you have many unhealthy ingesters, the healthy ones could get overloaded and this could end up into a cascading failure. In the past we had a configuraton option to change this behaviour, but was removed for simplicity. I think we could reconsider it (e.g. re-shard to healthy ones only if unhealthy are < a given % of total ingesters).

wilfriedroset commented 2 years ago

I think we could reconsider it (e.g. re-shard to healthy ones only if unhealthy are < a given % of total ingesters).

This would be a good middle ground as this would allow to now fail all writes if the cluster is able to sustain them.

Sam-H101 commented 2 years ago

We have this issue do the fact we use docker swarm for our mimir cluster. It would be a benefit if there was a flag that could be set to allow removing.

pracucci commented 2 years ago

We have this issue do the fact we use docker swarm for our mimir cluster. It would be a benefit if there was a flag that could be set to allow removing.

I'm not familiar with Docker Swarm, so sorry for my basic question: what's the problem specifically with Docker swarm? Is it that you can't anything similar to a Kubernetes StatefulSet? Could you share more details to let me better understand, please?

Andysimcoe commented 2 years ago

@Andysimcoe If you're able to set the same instance id for a new container that's spinning up to replace the unhealthy one, please post here. Would definitely be interested in seeing how that could work.

I have it working, for the ingesters I use a different image. I build it overwriting the entrypoing to a shell script. I was doing this anyway as I have the ingesters using an EFS filesystem across multiple AZ. So when the ingester starts it needs to either create a directory which it then assigns to '--blocks-storage.tsdb.dir=${tsdb_path}' or reuse an existing one (it also cleans up old dirs that may be orphaned).

How it handles that isn't massively relevant but I wrote a small Go binary that takes an argument are spits out various values, such as ports used by the host (if running in certain network modes on EC2 backed ECS, I've since moved everything to Fargate which uses awsvpc mode). Really all this is doing is hitting the meta data endpoint available in ECS (https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-metadata-endpoint-fargate.html) and parses the JSON output, one of these outputs is the container name, I shorten this and that's what I use for the instance_id. So my end command in the entrypoint script looks like:

# build the command to run
CMD="/bin/mimir --target=${MIMIR_TARGET} --config.file=/etc/mimir/mimir-conf.yml \
        --config.expand-env=true --blocks-storage.tsdb.dir=${tsdb_path} --ingester.ring.instance-id=${instance_id}"

# run
exec $CMD

MIMIR_TARGET is an environment variable set in the ECS task. If there's an existing directory it will use that and use the existing instance_id rather than the new container name.

In short how it determines how a directory is in use, I have another processing running the background that touches an .ownership file every 5 seconds so if this hasn't been touched in, let's say 10 seconds, this container takes ownership of it. I don't have this running in production but so far it looks to be working.

By background process, in the entrypoint script it's doing this:

# bg process to keep ownership
while true; do touch $tsdb_path/.ownership && sleep 5;  done &

owner_age=$(stat -c '%Y' $root_dir/$dir/.ownership)

Pretty hacky stuff but it's just for the ingester and seems to be OK, I'll leave it running while I work on a few other things and so far I've just been testing by stopping starting and killing the container to try and recreate a crash. They haven't crashed in the 5 days I've had this running though (I'm running 15 in this cluster). I'll likely make something a little nicer which is why I haven't posted the full script or the Go binary as I probably won't end up using it in this state, but it's the current path I'm on.

Sam-H101 commented 2 years ago

We have this issue do the fact we use docker swarm for our mimir cluster. It would be a benefit if there was a flag that could be set to allow removing.

I'm not familiar with Docker Swarm, so sorry for my basic question: what's the problem specifically with Docker swarm? Is it that you can't anything similar to a Kubernetes StatefulSet? Could you share more details to let me better understand, please?

It’s the same issue with the overall thread. Swarm you can not set container ids to be static. As it’s will replicate the images x number of times across all nodes

pracucci commented 2 years ago

It’s the same issue with the overall thread. Swarm you can not set container ids to be static. As it’s will replicate the images x number of times across all nodes

Can you have persistent disk? They would be required anyway because of the WAL. If so, then what we could do is to allow to persist the instance ID on disk (similarly to what we do for the tokens). Basically, the first time an ingester is started with a new disk, the instance ID will be auto-generated, then persistent to disk and subsequent restarts will load it from disk. Would that cover your use case?

RECturtle commented 2 years ago

For me, being able to persist the instance ID to disk would work. We can utilize EFS to write and read from. Would the idea be that you write X number of ids (# of mimir instances running) to a spot in a persistent store and Mimir would be able to read from those and pick up a "free" id? Or is the idea that each Mimir instance should have it's own persistent store and there would only be 1 id for it to read from (so it doesn't have to identify which one is free to use)? We could make either work, but the first option would be preferred. You could also be thinking something completely different, I'm just thinking through it myself 😆

Andysimcoe commented 2 years ago

If you're using EFS and ECS, it would work if you define each ingester as its own task and define the mount target as a directory within the EFS filesystem (or you may have multiple EFS filesystems).

But if you're doing that you can also just set an environment variable for each task defining the instance ID (e.g. ingester-01, ingester-02). The problem with that is, it's quite messy and scaling up requires creating a new task definition each time rather than just incrementing a number. It will work though.

If you want 1 task definition as part of a ingester service running in your ECS cluster, there's no nice way to persist storage out of the box. As you'll know containers aren't stopped and started, a new one is recreated so not the same container to have persistent storage.

I got it to work as my previous comment as more of a fun experiment but if you want out of the box Mimir to work, I think you'd have to have a task definition for each ingester. The rest of the components work fine without it.

pracucci commented 2 years ago

Or is the idea that each Mimir instance should have it's own persistent store and there would only be 1 id for it to read from (so it doesn't have to identify which one is free to use)?

The idea I had in mind was this one. We require a per-replica dedicated storage anyway for the WAL, no? How do you handle the WAL replay when an ingester container is restarted/rescheduled? The same approach should work to persist the instance ID too.

callumj commented 1 year ago

I think we could reconsider it (e.g. re-shard to healthy ones only if unhealthy are < a given % of total ingesters).

@pracucci This is something we’d love to have in Mimir (and something I think we miss from Cortex, my read is they do a floor on the min replication factor vs a ceiling) is the ability for us to prefer availability even at the risk that we are running with one ingester for a given series.

We’d rather have just one ingester around and able to accept writes (so real-time alerting us not impacted) and us have an alert on unhealthy ingesters for us to intervene.

grafana / mimir