raft-tech / TANF-app

Repo for development of a new TANF Data Reporting System
Other
17 stars 4 forks source link

Prometheus Persistent Storage #3244

Open elipe17 opened 1 month ago

elipe17 commented 1 month ago

Description: Generally, Prometheus servers use a local time series database (tsdb) to store metrics and series to make queryable later. This is an issue because cloud.gov apps do not have persistent file systems. That is, whenever an app is restaged/redeployed the file system starts anew meaning Prometheus loses all of it's state. Normally this isn't an issue with traditional cloud deployments, however, cloud.gov regularly restages applications (at least once per week) meaning Prometheus will have at most one weeks worth of data. This is unacceptable and is only remedied in cloud.gov by introducing a remote read/write server to store and persist Prometheus data to S3. Grafana labs has already resolved this problem via Mimir. Thus, we want to integrate Prometheus with Mimir to ensure the persistence of our time series data.

Acceptance Criteria: Create a list of functional outcomes that must be achieved to complete this issue

Tasks: Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue

Notes: Add additional useful information, such as related issues and functionality that isn't covered by this specific issue, and other considerations that will be helpful for anyone reading this

Supporting Documentation: Please include any relevant log snippets/files/screen shots

Open Questions: Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete

elipe17 commented 3 weeks ago

@ADPennington, @lfrohlich, @vlasse86, and @andrew-jameson I've given my RAM estimates for Mimir below. Again, we can always start lower, check the impact, and scale from there. I used this resource to calculate the RAM estimate based on the Distributor, Ingester, and Alertmanager sections. The rest of the sections don't have a strong effect for our purposes. Also note that I assumed 30,000 series total for these calculations. In a full fledged deployment (monitoring all envs) I expect ~20,000 series. I inflated to 30,000 series for coverage.

Total RAM with Ingester replication of 3: 80MB (Distributor) + 750MB (Ingestor) + 10MB (Alerts) + overhead ~= 1GB RAM Total RAM no Ingestor replication: 80MB (Distributor) + 250MB (Ingestor) + 10MB (Alerts) + overhead ~= 512MB RAM

Feel free to ping me in Mattermost or on this thread if you have any questions.