grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.96k stars 496 forks source link

Docs: interactive capacity planning tool #1988

Open pracucci opened 2 years ago

pracucci commented 2 years ago

We're hearing feedback from OSS community (e.g. this Slack thread) that capacity planning doc apparently show more resources than probably required. I think a reason is that there are multiple factors to run a proper capacity planning, while the doc is an oversimplification.

We could provide an interactive capacity planning tool where given some input (e.g. active series, samples/sec, queries/sec, retention, ...) we compute a more accurate capacity plan.

An option could build a Google Spreasheet and embed it in doc.

replay commented 2 years ago

I think that a spreadsheet seems like the easiest solution, OTH if we'd create a CLI tool to do the capacity planning then all Mimir contributors could contribute to it and possibly improve it based on their own experience, a spreadsheet would likely have to be restricted in some way because otherwise there is no review process for changes to the spreadsheet.

rojas-diego commented 2 years ago

Let me know if you're looking for contributions on this!

pracucci commented 2 years ago

a spreadsheet would likely have to be restricted in some way because otherwise there is no review process for changes to the spreadsheet

That's right. And it's also more difficult to test. Writing unit tests in golang is way easier.

if we'd create a CLI tool

In this case we wouldn't have to create a new tool. We already have mimirtool: we could just add a command there.

Let me know if you're looking for contributions on this!

We do! Let's just reach a consensus on how it should work (e.g. spreadsheet vs CLI tool). Let me ping rest of Mimir maintainers / squad, to get a quick feedback loop.

Logiraptor commented 2 years ago

I would vote for a CLI tool - because it would allow reviews, change history, etc in a more familiar format for Mimir contributors. Having the logic implemented as go code also makes it easier to eventually extend into more sophisticated use-cases (far in the future) like an auto-scaling operator, generating helm values file automatically, etc. We can start to get feedback on the formulas used and build on that knowledge later.

For now something simple + straightforward like a new command in mimirtool with a simple text output seems like the best place to start to me

osg-grafana commented 2 years ago

cc @osg-grafana

pstibrany commented 2 years ago

What would CLI tool look like? I have hard time imagining command-line interface that would beat the spreadsheet (or simple webpage with some javascript to do the calculation) in terms of ease of use.

pracucci commented 2 years ago

What would CLI tool look like?

Something like:

mimirtool capacity-planning --active-series=100000 --samples-per-second=15000 --queries-per-second=100

I have hard time imagining command-line interface that would beat the spreadsheet (or simple webpage with some javascript to do the calculation) in terms of ease of use.

From an ease of use perspective, I agree a web UI would be easier to use. On the other side, collaborating on a web UI may be more complicated (e.g. no code reviews and no external contributors on spreadsheet, not much JS experience not even enough tooling like unit tests, ...).

Given we publish mimirtool binary for multiple platforms, and assuming that you can run a CLI tool if you want to operate Mimir, I don't see mimirtool as a significant friction.

pstibrany commented 2 years ago

On the other side, collaborating on a web UI may be more complicated (e.g. no code reviews and no external contributors on spreadsheet, not much JS experience not even enough tooling like unit tests, ...).

I don't think single HTML page with some javascript would be too difficult to review and collaborate on, but you're right that we don't have tooling for it prepared. (Maybe writing it in Go and compiling into webasm would work just fine? đŸ˜„ I have 0 experience with that though.)

Your example isn't too bad just yet, but it gets more complex with more parameters very quickly.

replay commented 2 years ago

Your example isn't too bad just yet, but it gets more complex with more parameters very quickly.

If it gets too complex we could consider to provide the tool with a configuration file, where the configuration file defines all the relevant parameters. Then we could deliver the tool together with an example configuration file, so a user could just copy the example configuration file and adjust all the defined parameters there. I think this will be easier to use then looking up lots of cli args from --help and adding them to the cli command.

pstibrany commented 2 years ago

Then we could deliver the tool together with an example configuration file, so a user could just copy the example configuration file and adjust all the defined parameters there.

/half-joke: We can distribute jsonnet file with example values and all the math, and let people edit and render that :)

replay commented 2 years ago

We can distribute jsonnet file with example values and all the math, and let people edit and render that :)

nice idea, but i kind of suspect that most users will stick to helm and don't know how to use jsonnet.

pracucci commented 2 years ago

One of the requirements is that we need to use a language for which it's not complicated to write unit tests. I think jsonnet doesn't fit it.

pstibrany commented 2 years ago

One of the requirements is that we need to use a language for which it's not complicated to write unit tests.

I don't see big benefit of unit-testability in this specific case given that the feature is basically set of formulas that show some numbers to the user.

As a user of this feature, I want to:

I see these needs covered better by tools like Google Sheets or Jsonnet rather than tool with hardcoded-formulas in it.

If we wanted to go jsonnet route, we could embed jsonnet interpreter library into mimirtool capacity plan and not require it as separate dependency. We could even parse the jsonnet output and pretty-print it nicely. I suggested jsonnet as a joke, but I don't think it's such a terrible idea.

And we have plenty of tests for our jsonnet config in the Mimir repo already.

pracucci commented 2 years ago

My idea is to build two tools:

  1. Run a command with runs a bunch of queries against running Mimir cluster(s) metrics and generates a file containing "constants" (e.g. 1 core every 1M series per ingester, etc...). This tool could be run to extract intelligence from all Mimir clusters running at Grafana Cloud and share it with the rest of the world, comitted to the Mimir repo.
  2. Add capacity planning command to mimirtool, taking in input your estimated usage (e.g. active series, samples per second, queries per second, retention, ...). It reads the constants file and compute the capacity planning based on that.
wilfriedroset commented 2 years ago

What would be the output of mimirtool? I reckon that core/memory/disk per mimir module should be enough. With that users should have enough information to decide how many pods/instances to deploy per module. It also factor in the fact that mimir can be deployed on baremetal as well.

I've be working on a similar tool which address this question the other way around. The input is the flavor/count of instance per module with 3 additional factors:

Here is an example of the output

{
    "performance": {
        "write path": {
            "distributor samples/sec": 120000,
            "ingester active series": 1920000
        },
        "read path": {
            "query-frontend queries/sec": 1200,
            "query-scheduler queries/sec": 2400,
            "querier queries/sec": 48,
            "store-gateway queries/sec": 192,
            "active series": 36923077
        },
        "compaction": {
            "compactable active series": 60000000
        }
    },
    "specs": {
        "write path": {
            "distributor": {
                "count": 3,
                "flavor": "b2-15"
            },
            "ingester": {
                "count": 3,
                "flavor": "b2-60"
            },
            "compactor": {
                "count": 3,
                "flavor": "b2-60"
            }
        },
        "read path": {
            "query-frontend": {
                "count": 3,
                "flavor": "b2-15"
            },
            "query-scheduler": {
                "count": 3,
                "flavor": "b2-15"
            },
            "querier": {
                "count": 3,
                "flavor": "b2-15"
            },
            "store-gateway": {
                "count": 3,
                "flavor": "b2-60"
            }
        },
    }
}

(the flavor are based on OVHcloud public cloud)

pracucci commented 2 years ago

What would be the output of mimirtool? I reckon that core/memory/disk per mimir module should be enough.

I would also add number of replicas per Mimir component. Output format should be configurable, ideally supporting:

maximum capacity, the deployment of mimir depends on how full you want your cluster to be. should it be at 50% capacity? 60%

Right. At Grafana Labs we call it "target capacity" and that should be another input factor too.

osg-grafana commented 2 years ago

Scoping estimation high because this doc ticket is large unactionable at its current stage in development.

osg-grafana commented 10 months ago

Removing from Docs Squad backlog because @cristiangsp and @osg-grafana agree that it is in Engineering’s hands.