Prometheus Persistent Storage

Description: Generally, Prometheus servers use a local time series database (tsdb) to store metrics and series to make queryable later. This is an issue because cloud.gov apps do not have persistent file systems. That is, whenever an app is restaged/redeployed the file system starts anew meaning Prometheus loses all of it's state. Normally this isn't an issue with traditional cloud deployments, however, cloud.gov regularly restages applications (at least once per week) meaning Prometheus will have at most one weeks worth of data. This is unacceptable and is only remedied in cloud.gov by introducing a remote read/write server to store and persist Prometheus data to S3. Grafana labs has already resolved this problem via Mimir. Thus, we want to integrate Prometheus with Mimir to ensure the persistence of our time series data.

Acceptance Criteria: Create a list of functional outcomes that must be achieved to complete this issue

[ ] Mimir deployed locally and in cloud.gov
[ ] Mimir integrated with cloud.gov S3 bucket
[ ] Prometheus configured to read and write data to Mimir
[ ] Testing Checklist has been run and all tests pass
[ ] README is updated, if necessary

Tasks: Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue

[ ] Mimir added to local docker compose
[ ] Local Mimir configured to write data to localstack
[ ] Prometheus remote read/write config updated for local and deployed configs to send/receive data to/from Mimir
[ ] Mimir manifest added and deployed to cloud.gov
[ ] Prometheus redeployed with new config to connect to Mimir
[ ] PLG deploy script updated to deploy Mimir
[ ] Run Testing Checklist and confirm all tests pass

Notes: Add additional useful information, such as related issues and functionality that isn't covered by this specific issue, and other considerations that will be helpful for anyone reading this

Note 1
Note 2
Note 3

Supporting Documentation: Please include any relevant log snippets/files/screen shots

Doc 1
Doc 2

Open Questions: Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete

Open Question 1
Open Question 2

@ADPennington, @lfrohlich, @vlasse86, and @andrew-jameson I've given my RAM estimates for Mimir below. Again, we can always start lower, check the impact, and scale from there. I used this resource to calculate the RAM estimate based on the Distributor, Ingester, and Alertmanager sections. The rest of the sections don't have a strong effect for our purposes. Also note that I assumed 30,000 series total for these calculations. In a full fledged deployment (monitoring all envs) I expect ~20,000 series. I inflated to 30,000 series for coverage.

Total RAM with Ingester replication of 3: 80MB (Distributor) + 750MB (Ingestor) + 10MB (Alerts) + overhead ~= 1GB RAM Total RAM no Ingestor replication: 80MB (Distributor) + 250MB (Ingestor) + 10MB (Alerts) + overhead ~= 512MB RAM

Feel free to ping me in Mattermost or on this thread if you have any questions.

raft-tech / TANF-app

Prometheus Persistent Storage #3244