prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.64k stars 2.15k forks source link

[Proposal] Change Config Format #2302

Open bigangryrobot opened 4 years ago

bigangryrobot commented 4 years ago

Problem

Single file alertmanager configuration with no external file loading creates a frustratingly hard system to configure and manage. This is evedenced further by the prometheus operator tickets here:

https://github.com/coreos/prometheus-operator/issues/2927,https://github.com/coreos/prometheus-operator/issues/2766, https://github.com/coreos/prometheus-operator/issues/1528, https://github.com/coreos/prometheus-operator/issues/1498, https://github.com/coreos/prometheus-operator/issues/2957

Proposal

Specific Config changes

Core config file would have 3 additional sections to allow file loading elements. File loading would follow the current prometheus design firing up watchers for the paths given

routes_files
  - /etc/alertmanager/routes.yaml
receivers_files
  - /etc/alertmanager/receivers/*.yaml
inhibit_files
  - /etc/alertmanager/inhibit/*.yaml
brian-brazil commented 4 years ago

This proposal is a bit confusing, it seems to aim to make fundamental breaking changes without explaining what the involved changes are.

From what I can piece together #180 already covers inhibitions, and the other semantics are already possible.

bigangryrobot commented 4 years ago

Let me get some code examples together and that may help clarify. Also this will be 100% backwards compatible

bigangryrobot commented 4 years ago

So first of all, haha my apologizes if things are unclear. I looked for a few examples in the issues to show me how best to describe this to everyone. The key idea. there should be no root route, instead we need to allow multiple roots https://github.com/prometheus/alertmanager/blob/master/config/config.go#L176 and https://github.com/prometheus/alertmanager/blob/master/config/config.go#L422

with a single root route we are stuck having all of the config in one flat file. Once we move to multiple roots we can implement file loading similar to how prometheus loads rule files which will allow teams to create their own sets of routes, inhibits and recievers without having conflict or needing to change the single root or core file

rule_files:
- /etc/prometheus/rules/*.rules
bigangryrobot commented 4 years ago

I guess alternative here would be to allow child routes to be injected via additional configs but still the tree based routing, filtering and suppression applies to the entire tree, which makes me want to separate it out completely

roidelapluie commented 4 years ago

It seems like you are just looking to run multiple independent clusters of Alertmanager?

bigangryrobot commented 4 years ago

It seems like you are just looking to run multiple independent clusters of Alertmanager?

While this is an option, I dont think that it has to be the only one. I feel that the design of alertmanager should incorporate flexibility in the configuration to accommodate more complex needs. As an example, i already run around 40 or so kubernetes clusters with alertmanagers deployed into each and around 12 alertmanagers in onprem datacenters. Moving to a model of multitennancy requiring individual alertmanagers means that i not only have to multiply my current deployments by a factor of 20 or 30 for our current teams, which is not a best case option, I would also have to automate the instantiation of an entire alert manager with global defaults and the team's needs.

I think that the prometheus operator team covers this nicely in their design review of the alert manager limitations and the desire to incorporate crd based template of the config, though they are essentially working around this core design limitation in alertmanager

Alternative solutions External configuration management system One could use an existing configuration system to generate the Alertmanager configuration out of band and update the Alertmanager configuration secret as needed.

It means that everything can be managed with CRDs except for Alertmanager which isn’t satisfying. And it would be another solution that someone has to maintain.

Prometheus/Alertmanager per namespace or team An application developer can deploy Alertmanager instances directly. But it is a waste of resources and adds unnecessary maintenance burden.

bigangryrobot commented 4 years ago

Ok I had a bit of a think on this after a review of the code and reworked my proposal. This would keep with the core design principle and essentially inject sections of config into the core config, though agreeably in a bit better design then doing so outside of alertmanager

simonpasquier commented 4 years ago

Leaving apart per-route inhibitions (which is tracked in https://github.com/prometheus/alertmanager/issues/180), I don't see anything that couldn't be covered by prometheus-operator once it supports the new AlertmanagerConfig CRD (hint: I've contributed the design proposal). For people that aren't on sold to prometheus-operator, they still need something to manage their configuration (e.g. Ansible, Chef, ...) which can do a similar job.

Managing configuration across multiple files doesn't follow the principle of least surprise IMO (how do you resolve conflicts when multiple files define the same route name?). Finally we don't want to support different approaches for the various projects under the Prometheus organization (https://github.com/prometheus/prometheus/issues/5519 was a similar request for Prometheus).

bigangryrobot commented 4 years ago

Agreed that the pre route inhibitions can be left out for now and i've removed those from the proposal. Focusing strictly on the configuration portions and specifically having external configuration management either prom-operator or otherwise after reviewing the issues presented in [#5519](was a similar request for Prometheus).) and elsewhere, i have a few questions.

One of the areas that prometheus provides the most flexibility in loading config blocks via file is rule_files. This is also done in the way that I think would be relevant for prometheus scrape_configs as well as all of the areas that I discussed above for alertmanager also is plagued by the principle of least surprise as indicated by prometheus issue #6334. If loading from files is non desirable, and principle of least surprise is important, then why did we plan to load from files here and treat the name conflicts as such?

Further, with prometheus or alertmanager forcing most areas to be configured through external configuration systems, not only do we run into the complexity and possible errors of building or injecting config blocks, we must also ensure that individuals or teams providing configuration changes must have access to the configuration management tool. This is easy in the case of prometheus-operator, but gets increasingly more difficultly as you move away from kubernetes.

Configuration flexibility should be an inherent piece of the tool, not something that has to be tacked on afterwards. Prometheus and alertmanager should be configurable via any manner of tools working together in any environment and not need workarounds by configuration management systems to accommodate its design.

What I, and it seems others in these tickets, am asking for is not a new approach for configuration, rather extending the existing file loading process and logic found in prometheus rule_files into other areas within the alertmanager and Prometheus configuration