Support and expose an SLO-based monitoring interface

sugarraysam commented 2 years ago

(Late comment on the proposal: https://github.com/openshift/enhancements/pull/866#pullrequestreview-771897761)

Hello, I am part of the MTSRE team at Red Hat and we would like to expose an SLO-based monitoring interface to our tenants. The goals behind this request are:

Be able to easily measure SLOs/SLIs and error budgets
Abstract as much of the monitoring stack as possible to our MTO (prometheus/alertmanager/grafana configs) so that they can spend their time doing feature development and bug fixing
Make sure we perform active reconciliation on this interface (operator-pattern)

There exists various tools out there that could help us abstract away prometheus/alertmanager and grafana configs:

Sloth: https://github.com/slok/sloth Keptn: https://github.com/keptn/keptn

I personally like sloth a lot, as it already has a kubernetes operator and an SLO-based CRD: PrometheusServiceLevel. Also, it maintains a set of uniform grafana dashboards, that can be automatically deployed: https://sloth.dev/dashboards/.

Benefits of an SLO-based monitoring approach:

Provide better context on what you monitor, and put your users first
Abstract away the monitoring stack, as prometheus/alertmanager/grafana configs are challenging
Provide a uniform visualization interface by reusing the same grafana dashboards for all addons. This is a big plus for me as developing dashboards is extremely time-consuming, MTSRE engineers will get a huge productivity boost from an homogeneous visualization interface.
Reconcile SLOs and generated monitoring configs, to allow evolution and iteration (operator-pattern)

This is also the approach that was chosen by the AppSRE team, see /schemas/app-sre/slo-document-1.yml in app-interface.

bwplotka commented 2 years ago

This is an amazing idea, thanks for this.

To rephrase - it's essentially an idea to be able to specify SLO and automatically generate Grafana Dashboards and Prometheus Rules.

A couple of thoughts from my side:

We spoke about it with @smarterclayton and our idea was to potentially add SLO like fields to ServiceMonitor directly. Imagine - technically it should service owner responsibility to define SLO AND how to monitor this service. Problem is that we probably don't want to tie Prometheus Operator with Grafana.
We have Prometheus Operator that handles rules and Grafana Operator that allows specifying dashboards. We already plan to depend on those in MST.

I think this should be handled as an extension, similar to what we want to do with Grafana. I don't think there is anything blocking us from an enhancement point of view, so I think it's fine to not mention this for now there.

Let's leave this issue so we can track some PoC to integrate this with Monitoring Stacks going forward. Some controller that has SLO CR and produces PrometheusRule CR and Dashboard CR sounds quite feasible (:

sugarraysam commented 2 years ago

Perfect, thank you for such a positive answer. :)

I think adding SLO fields to a ServiceMonitor sounds great. One thing I realized, is the new solution needs to allow ingesting simple metrics that will be combined in a complex promql expression to build an SLI.

Here is an example from app-interface:

slos:
- name: API availability
  SLIType: availability
  SLISpecification: Percentage of requests that are served successfully (status code != 5xx)
  SLODetails: https://gitlab.cee.redhat.com/service/app-interface/docs/service-registry/slos/srs-fleet-manager-api-availability.md
  SLOParameters:
    window: 28d
  expr: |
    sum(rate(haproxy_backend_http_responses_total{route="srs-fleet-manager",exported_namespace="service-registry-stage",code!="5xx"}[{{window}}]))
    /
    sum(rate(haproxy_backend_http_responses_total{route="srs-fleet-manager",exported_namespace="service-registry-stage"}[{{window}}]))

You can see how the SLI is a complex PromQL query, but requires prometheus to already have ingested a simple metric haproxy_backend_http_responses_total.

I also want to say that we have engineering time on MTSRE side to help develop/review or work on some PoC, as the monitoring story is very important for us.

bwplotka commented 2 years ago

👍🏽 Copying comment from https://github.com/openshift/enhancements/pull/866#issuecomment-937729043

As per your comments @sugarraysam thanks for your idea. It is being discussed in rhobs/monitoring-stack-operator#13 and I think it's a solid thing to implement on top of ServiceMonitors. However, I don't think we can remove everything else which is not SLO based. Monitoring is more than SLO. It's also about troubleshooting, sometimes billing, data analysis of features etc. So I would not trim it all down to SLO usage only, even as MTSRE you have the right to ignore anything else. Does it make sense? (:

rhobs / observability-operator

Support and expose an SLO-based monitoring interface #13