Closed mpchadwick closed 6 years ago
Is the hand-written architecture diagram still valid? Do you have any preference about a tool to "digitalize" it?
I want to figure out the architecture, but the hand drawn is not exactly readable to me...
The hand-written diagram looks to still be correct. Is there a preferred tool for making these diagrams? I'm all for a legible diagram of the internals.
You can try OmniGraffle (:
I used the same tool as Prometheus architecture document (https://github.com/prometheus/prometheus/tree/master/documentation/images) and tried to keep the same look and feel.
Few questions: How to name the boxes inside the Dispatcher? How to name the boxes after the Router? Should the arrow from Silencer to Silence Storage be reversed?
This looks great!
How to name the boxes inside the Dispatcher?
The dispatcher's main work is done within the loop inside run()
. It is concerned with reading the incoming alerts (as you've indicated) and creating the aggregate groups (alerts that are grouped together as defined by group_by
in the config). The aggregate group itself is a loop that executes a notification pipeline every group_interval
. The first time an aggregate group is created, it waits group_wait
before execution -- after that, it is always group_interval
. So, I would probably say something like "group alerts" and try to indicate that each group executes a notification pipeline.
How to name the boxes after the Router?
The Router is actually the Fanout stage, where the set of alerts is dispatched concurrently to a pipeline for every notification type for that receiver. There's the WaitStage, where a timer is set based on the server's position in the HA mesh (--cluster.peer-timeout
* position), Dedup, Retry, and SetNotifies. SetNotifies sends a message to peers on the mesh saying what alerts it has sent out. So, an error from SetNotifies to the NotifyProvider, and the DedupStage queries the notify provider. It's hard to convey this in a single diagram since the DedupStage is querying the NotifyProvider for state that the NotifyProvider received from other alertmanagers :)
Should the arrow from Silencer to Silence Storage be reversed?
The SilenceStage queries the SilenceProvider for data, but it doesn't actually change any data in the provider. I think the arrow should stay how it is.
There's one stage in the pipeline that's missing, the GossipSettleStage. It waits for the gossip's initial settling period, so that a rebooted alertmanager can receive recent notification messages from its peers before potentially firing duplicate alerts.
Thanks for the feedback @stuartnelson3. Below is the updated diagram, I switched the orientation to make it align better and have more space to grow the stages and add the clustering. Happy to adapt as needed, I followed the code but didn't twelve too deep.
Looks good! The only thing is, the individual notification endpoints should receive their message during the Retry
stage, and after a successful send, then the SetNotifies
stage fires.
Except for that, does this look understandable to people? @mxinden @brancz @brian-brazil @fabxc @grobie @beorn7 ?
I'd suggest making it clearer that the groups are groups.
@sgissi Great job! I like the clean structure.
How about instead of Alert Generators
simply Prometheus
? That should cover most Alertmanager use-cases.
How about instead of Alert Generators simply Prometheus? That should cover most Alertmanager use-cases.
I know the interface is generic but I’m unaware of other tools using it. I’m Ok to change that. Will send a revised diagram today also to make groups more explicit somehow.
Ok, added below (didn't align all the details yet obviously). I kept Alert Generators and subtitled Prometheus just being a purist :) I can drop it and keep only Prometheus if that is the consensus.
I changed the dispatcher to show the Aggregated Groups instead, hope it is clearer. Otherwise the only way to convey the actual flow is to add a "fan out" stage inside the dispatcher towards the groups and each group flowing independently to the first pipeline stage. Hard to put all that in a small space... I'll give it a try tomorrow moving the "flush periodically" comment around and making the dispatch box taller.
Will also update the stage swap of the actual notification and the Set Notifies, indeed doesn't align to what really happens.
Ok, next iteration:
Less straight lines than I would like but more complete.
@mxinden I believe this issue can be closed as well from #1394.
@sgissi Thanks a lot for all your work. Especially bearing with us over all the feedback iterations. Highly appreciated!
I will close here. @mpchadwick and @community let us know if you have further comments.
The hand drawn one in the README doesn't exactly look great...