prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.57k stars 2.14k forks source link

Update the Architecture Diagram #331

Closed mpchadwick closed 6 years ago

mpchadwick commented 8 years ago

The hand drawn one in the README doesn't exactly look great...

pracucci commented 7 years ago

Is the hand-written architecture diagram still valid? Do you have any preference about a tool to "digitalize" it?

giant-panda666 commented 6 years ago

I want to figure out the architecture, but the hand drawn is not exactly readable to me...

stuartnelson3 commented 6 years ago

The hand-written diagram looks to still be correct. Is there a preferred tool for making these diagrams? I'm all for a legible diagram of the internals.

giant-panda666 commented 6 years ago

You can try OmniGraffle (:

sgissi commented 6 years ago

I used the same tool as Prometheus architecture document (https://github.com/prometheus/prometheus/tree/master/documentation/images) and tried to keep the same look and feel.

am architecture

Few questions: How to name the boxes inside the Dispatcher? How to name the boxes after the Router? Should the arrow from Silencer to Silence Storage be reversed?

stuartnelson3 commented 6 years ago

This looks great!

How to name the boxes inside the Dispatcher?

The dispatcher's main work is done within the loop inside run(). It is concerned with reading the incoming alerts (as you've indicated) and creating the aggregate groups (alerts that are grouped together as defined by group_by in the config). The aggregate group itself is a loop that executes a notification pipeline every group_interval. The first time an aggregate group is created, it waits group_wait before execution -- after that, it is always group_interval. So, I would probably say something like "group alerts" and try to indicate that each group executes a notification pipeline.

How to name the boxes after the Router?

The Router is actually the Fanout stage, where the set of alerts is dispatched concurrently to a pipeline for every notification type for that receiver. There's the WaitStage, where a timer is set based on the server's position in the HA mesh (--cluster.peer-timeout * position), Dedup, Retry, and SetNotifies. SetNotifies sends a message to peers on the mesh saying what alerts it has sent out. So, an error from SetNotifies to the NotifyProvider, and the DedupStage queries the notify provider. It's hard to convey this in a single diagram since the DedupStage is querying the NotifyProvider for state that the NotifyProvider received from other alertmanagers :)

Should the arrow from Silencer to Silence Storage be reversed?

The SilenceStage queries the SilenceProvider for data, but it doesn't actually change any data in the provider. I think the arrow should stay how it is.

There's one stage in the pipeline that's missing, the GossipSettleStage. It waits for the gossip's initial settling period, so that a rebooted alertmanager can receive recent notification messages from its peers before potentially firing duplicate alerts.

sgissi commented 6 years ago

Thanks for the feedback @stuartnelson3. Below is the updated diagram, I switched the orientation to make it align better and have more space to grow the stages and add the clustering. Happy to adapt as needed, I followed the code but didn't twelve too deep.

am architecture updated

stuartnelson3 commented 6 years ago

Looks good! The only thing is, the individual notification endpoints should receive their message during the Retry stage, and after a successful send, then the SetNotifies stage fires.

Except for that, does this look understandable to people? @mxinden @brancz @brian-brazil @fabxc @grobie @beorn7 ?

brian-brazil commented 6 years ago

I'd suggest making it clearer that the groups are groups.

mxinden commented 6 years ago

@sgissi Great job! I like the clean structure.

How about instead of Alert Generators simply Prometheus? That should cover most Alertmanager use-cases.

sgissi commented 6 years ago

How about instead of Alert Generators simply Prometheus? That should cover most Alertmanager use-cases.

I know the interface is generic but I’m unaware of other tools using it. I’m Ok to change that. Will send a revised diagram today also to make groups more explicit somehow.

sgissi commented 6 years ago

Ok, added below (didn't align all the details yet obviously). I kept Alert Generators and subtitled Prometheus just being a purist :) I can drop it and keep only Prometheus if that is the consensus.

I changed the dispatcher to show the Aggregated Groups instead, hope it is clearer. Otherwise the only way to convey the actual flow is to add a "fan out" stage inside the dispatcher towards the groups and each group flowing independently to the first pipeline stage. Hard to put all that in a small space... I'll give it a try tomorrow moving the "flush periodically" comment around and making the dispatch box taller.

Will also update the stage swap of the actual notification and the Set Notifies, indeed doesn't align to what really happens.

am architecture

sgissi commented 6 years ago

Ok, next iteration:

Less straight lines than I would like but more complete.

am architecture

sgissi commented 6 years ago

@mxinden I believe this issue can be closed as well from #1394.

mxinden commented 6 years ago

@sgissi Thanks a lot for all your work. Especially bearing with us over all the feedback iterations. Highly appreciated!

I will close here. @mpchadwick and @community let us know if you have further comments.