Open chrisbecke opened 2 years ago
Thanks for this proposal. Here are some thoughts:
In terms of storage, each Prometheus instance manages its own storage, so once federated, the stack local instances only need enough retention for their own private rules. If any. In the case that the main Prometheus instance (serving a main Grafana instance for visualization and Grafana based alerting) is the only important data store, you can set the stack local instances to really short retention periods, and not mount the db for persistence at all, so its pruned if/when the stack local Prometheus is restarted.
The params: match[]:
specifies which labels the main Prometheus instance scrapes, so in a bigger setup it might be needed to come up with a convention for filtering which metrics the main Prometheus scrapes. If stack local Prometheus metrics have their own node-exporter metrics etc there could be a massive bloom of metrics, in which case filtering metrics that are explicitly labeled for scraping would be necessary.
Note: If we do this, we should do the implementation over at https://github.com/neuroforgede/swarmsible-stacks
Description
As a consumer of a Swarm, I want to deploy a stack that contains its own Prometheus instance. This prometheus instance already knows how to scrape all the services in this stack. However, all the metrics need to be scraped by the swarms main Prometheus instance to arrive in the central Grafana dashboard.
Proposal
The main prometheus instance can contain a federation job. Something like this :-
Two additional requirements are present: a common prometheus network, and a convention based naming approach: each child prometheus instance needs to add itself to the common prometheus network, and declare an alias there that allows its discovery by the main Prometheus instance.
Stack local prometheus instances could use this minimal declaration to become eligable for scraping.
Result By probing
scrape.target
viadns_sd_configs
the main instance gets a dynamic list of IPs of all active stack local prometheus instances, and grabs all their metrics via the/federate
endpoint.