grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.17k stars 535 forks source link

Docs: improve Planning capacity page #1469

Open KMiller-Grafana opened 2 years ago

KMiller-Grafana commented 2 years ago

1.

Screen Shot 2022-03-11 at 2 48 05 PM

See the link under the heading "Monolithic mode?" It is a link to the next paragraph/section. Super unhelpful link for any reader that clicks on it, since it goes to the next sentence. Just remove it.

  1. Rename this section from "Planning capacity" to something more like "Estimating resource usage." The info under the headings "Monolithic mode" and "Microservices mode" don't give us any help on planning capacity. They do help a user to estimate resource usage.
  2. Consider changing "utilization" to "usage."
pracucci commented 2 years ago

We should also mention to use fast disks for ingesters and store-gateways (see https://github.com/grafana/mimir/issues/1722#issuecomment-1112789110).

09jvilla commented 2 years ago

Maybe this will just be taken care of in https://github.com/grafana/mimir/issues/1988 but recently I was looking at the capacity planning page and was a bit confused when I read


CPU: 1 core for every 300,000 series in memory
Memory: 2.5GB for every 300,000 series in memory
Disk space: 5GB for every 300,000 series in memory

Is the idea that I calculate the total number of active series in my cluster and then figure out the cpu, memory, and disk space requirements for all ingesters in the whole cluster? How do I figure out how many ingesters I need and what the individual resources allocated to each ingester should be? Do I arbitrarily pick a number of ingesters and then just divide the total resource requirements by the number of ingesters?

09jvilla commented 2 years ago

For the ingesters specifically, is the disk space requirement at all impacted by how many hours of data I want to retain on disk?

Logiraptor commented 2 years ago

I wonder if ingester disk usage would be better estimated as a function of DPM rather than active series.

In any case, I think the ingester sizing that @09jvilla points out is using some unstated assumptions about the scrape interval and retention period.

pracucci commented 2 years ago

The capacity planning doc was initially conceived to be a simplification and have 1 single metric per component to use for scaling (for ingesters I picked active series). I understand it was an oversimplification and it's showing its limits. My feeling is that documenting all proper math would make it quite complicated for the user, that's why I would move forward replacing it with a tool, where we incapsulate all our logic.

I wonder if ingester disk usage would be better estimated as a function of DPM rather than active series.

Yes, it would.

osg-grafana commented 2 years ago

Estimated high due to unactionable state of doc issue and necessary research if implemented.

mac133k commented 2 years ago

The guidelines for Alertmanager seem too low:

Perhaps it was meant to say '100 firing alerts per second'? It does not seem right for a single alert to consume 10MB of RAM.

pracucci commented 2 years ago

The guidelines for Alertmanager seem too low:

@mac133k You're right. See my PR to update it: https://github.com/grafana/mimir/pull/3132