rhobs / observability-operator

Create Monitoring Stacks you need on your Kubernetes clusters!
31 stars 55 forks source link

Support and expose an SLO-based monitoring interface #13

Open sugarraysam opened 2 years ago

sugarraysam commented 2 years ago

(Late comment on the proposal: https://github.com/openshift/enhancements/pull/866#pullrequestreview-771897761)

Hello, I am part of the MTSRE team at Red Hat and we would like to expose an SLO-based monitoring interface to our tenants. The goals behind this request are:

There exists various tools out there that could help us abstract away prometheus/alertmanager and grafana configs:

Sloth: https://github.com/slok/sloth Keptn: https://github.com/keptn/keptn

I personally like sloth a lot, as it already has a kubernetes operator and an SLO-based CRD: PrometheusServiceLevel. Also, it maintains a set of uniform grafana dashboards, that can be automatically deployed: https://sloth.dev/dashboards/.

Benefits of an SLO-based monitoring approach:

This is also the approach that was chosen by the AppSRE team, see /schemas/app-sre/slo-document-1.yml in app-interface.

bwplotka commented 2 years ago

This is an amazing idea, thanks for this.

To rephrase - it's essentially an idea to be able to specify SLO and automatically generate Grafana Dashboards and Prometheus Rules.

A couple of thoughts from my side:

I think this should be handled as an extension, similar to what we want to do with Grafana. I don't think there is anything blocking us from an enhancement point of view, so I think it's fine to not mention this for now there.

Let's leave this issue so we can track some PoC to integrate this with Monitoring Stacks going forward. Some controller that has SLO CR and produces PrometheusRule CR and Dashboard CR sounds quite feasible (:

sugarraysam commented 2 years ago

Perfect, thank you for such a positive answer. :)

I think adding SLO fields to a ServiceMonitor sounds great. One thing I realized, is the new solution needs to allow ingesting simple metrics that will be combined in a complex promql expression to build an SLI.

Here is an example from app-interface:

slos:
- name: API availability
  SLIType: availability
  SLISpecification: Percentage of requests that are served successfully (status code != 5xx)
  SLODetails: https://gitlab.cee.redhat.com/service/app-interface/docs/service-registry/slos/srs-fleet-manager-api-availability.md
  SLOParameters:
    window: 28d
  expr: |
    sum(rate(haproxy_backend_http_responses_total{route="srs-fleet-manager",exported_namespace="service-registry-stage",code!="5xx"}[{{window}}]))
    /
    sum(rate(haproxy_backend_http_responses_total{route="srs-fleet-manager",exported_namespace="service-registry-stage"}[{{window}}]))

You can see how the SLI is a complex PromQL query, but requires prometheus to already have ingested a simple metric haproxy_backend_http_responses_total.

I also want to say that we have engineering time on MTSRE side to help develop/review or work on some PoC, as the monitoring story is very important for us.

bwplotka commented 2 years ago

👍🏽 Copying comment from https://github.com/openshift/enhancements/pull/866#issuecomment-937729043

As per your comments @sugarraysam thanks for your idea. It is being discussed in rhobs/monitoring-stack-operator#13 and I think it's a solid thing to implement on top of ServiceMonitors. However, I don't think we can remove everything else which is not SLO based. Monitoring is more than SLO. It's also about troubleshooting, sometimes billing, data analysis of features etc. So I would not trim it all down to SLO usage only, even as MTSRE you have the right to ignore anything else. Does it make sense? (: