Open elipe17 opened 1 month ago
@ADPennington, @lfrohlich, @vlasse86, and @andrew-jameson I've given my RAM estimates for Mimir below. Again, we can always start lower, check the impact, and scale from there. I used this resource to calculate the RAM estimate based on the Distributor, Ingester, and Alertmanager sections. The rest of the sections don't have a strong effect for our purposes. Also note that I assumed 30,000 series total for these calculations. In a full fledged deployment (monitoring all envs) I expect ~20,000 series. I inflated to 30,000 series for coverage.
Total RAM with Ingester replication of 3: 80MB (Distributor) + 750MB (Ingestor) + 10MB (Alerts) + overhead ~= 1GB RAM Total RAM no Ingestor replication: 80MB (Distributor) + 250MB (Ingestor) + 10MB (Alerts) + overhead ~= 512MB RAM
Feel free to ping me in Mattermost or on this thread if you have any questions.
Description: Generally, Prometheus servers use a local time series database (tsdb) to store metrics and series to make queryable later. This is an issue because cloud.gov apps do not have persistent file systems. That is, whenever an app is restaged/redeployed the file system starts anew meaning Prometheus loses all of it's state. Normally this isn't an issue with traditional cloud deployments, however, cloud.gov regularly restages applications (at least once per week) meaning Prometheus will have at most one weeks worth of data. This is unacceptable and is only remedied in cloud.gov by introducing a remote read/write server to store and persist Prometheus data to S3. Grafana labs has already resolved this problem via Mimir. Thus, we want to integrate Prometheus with Mimir to ensure the persistence of our time series data.
Acceptance Criteria: Create a list of functional outcomes that must be achieved to complete this issue
Tasks: Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue
Notes: Add additional useful information, such as related issues and functionality that isn't covered by this specific issue, and other considerations that will be helpful for anyone reading this
Supporting Documentation: Please include any relevant log snippets/files/screen shots
Open Questions: Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete