thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

Thanos Component Mixin Dashboards should show CPU & Memory limits. #4338

Closed bill3tt closed 3 years ago

bill3tt commented 3 years ago

Is your proposal related to a problem?

Debugging out internal monitoring stack with @bwplotka using the Thanos mixin dashboards - we struggled to tell if compoments were exceeding their allowed threshold because these are not present in the dashboards by default.

(Write your answer here.)

Describe the solution you'd like

Add 'standard' expected metrics to the component dashboards:

(Describe your proposed solution here.)

Describe alternatives you've considered

None.

Additional context

There will almost certainly be some really good example of these out in the wild. Two starting points:

bill3tt commented 3 years ago

Suggested labels: good first issue, component: mixins

wiardvanrij commented 3 years ago

As this requires more metrics than Thanos is actually exposing (I.e. node / kube-api metrics) - Would this not make sense to leave this at a totally different dashboard?

I personally have dashboards like this:

image

and

image

With a global view like:

image

These dashboards are (correct me if I'm wrong) also included in the Prometheus operator. If we would include these stats, you create a dependency on those metrics but also (I think) certain rules required for it.

fktkrt commented 3 years ago

Yes, these are included in the kubernetes-mixin and that is included in promethues-operator in a generalised fashion. This means that you have to choose the appropriate dashboard for these stats, then filter for the namespace in question to see these metrics for everything under that ns.

Some tool has similar metrics on their operational overview dashboard to make troubleshooting performance issues easier, Loki comes to mind, if I am right.

yeya24 commented 3 years ago

IMO these CPU & Memory limits are specific to Kubernetes env. The dashboards from Kubernetes mixin are sufficient for this use case and I feel it is not a good idea to add them to Thanos mixin.

bwplotka commented 3 years ago

I think mixin is fine, the only thing is that probably we should not reuse the same dashboards we have now.

Reason is that there is a difference between monitoring dashboards and deep down quick debugging dashboards which can have 100 graphs.

bill3tt commented 3 years ago

@wiardvanrij & @yeya24 you both raise an important point I hadn't considered. For the core Thanos project to remain environment neutral we can't introduce environment-specific dependencies into the mixin, we can only use metrics that are exported from the core project itself.

I still think there is scope to include relevant CPU & memory information from the go_* family of metrics, but I'm not sure exactly what quite yet.

@bwplotka I think you meant to post that comment in https://github.com/thanos-io/thanos/issues/4401 :) (which I hadn't yet created during the meeting yesterday).

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗