open-services-group / community

This repository handles a few common things, it is mainly used by our bots...
GNU General Public License v3.0
8 stars 24 forks source link

[SIG Services][Spike] Collect SLOs from various managed services #206

Closed tumido closed 1 year ago

tumido commented 2 years ago

Until next SIG Services on , pick a RH managed product (or any other managed service) and get a list of their SLOs so we can compare and derive an ultimate list of SLOs that can be used by service owners as an inspiration for their own SLOs.

If you pick a service, open a comment in here stating the service name (so we don't end up with multiple people working on the same service).

SamoKopecky commented 2 years ago

Example of SLOs for Google Dataplex.

Availability/Uptime

Example summary of SLOs

  1. Uptime -- >=99.5%
  2. Error rate -- <=5%
  3. Downtime period -- <=10 minutes
  4. Latency -- <= 100ms

There are many more examples of Google SLAs (SLOs) here.

tumido commented 2 years ago

I took a look at AppSRE for their SLOs:

Internal resources: https://source.redhat.com/groups/public/sre-services/sre_services_wiki/appsre_slos https://service.pages.redhat.com/dev-guidelines/docs/appsre/onboarding/creating-slos/

Schema: https://github.com/app-sre/qontract-schemas/blob/main/schemas/app-sre/slo-document-1.yml

Example: https://gitlab.cee.redhat.com/service/app-interface/-/blob/a1313ecc522abac8db6612069507e9c4bf934fe7/data/services/app-interface/slo-documents/app-interface.yml

Availability/Usage metrics:

Gkrumbach07 commented 2 years ago

Red Hat OpenShift Streams for Apache Kafka Service Def: https://access.redhat.com/articles/6473891 General Service Terms and Conditions: https://www.redhat.com/licenses/Appendix_4_Red_Hat_Online_Services_20211021.pdf

Summary

Availability

Support See this table for details: https://access.redhat.com/support/offerings/production/sla

Performance

Kafka specific limits See limits here: https://access.redhat.com/articles/5979061

There are more limits but they are specific to a Kafka cluster and not general to a service

codificat commented 2 years ago

The Thoth team has WIP/draft documentation on defining SLOs for the service.

There are still no final details on that, but here is a list of SLIs that are listed as the focus:

The quick summary is that the overall focus is in 2 aspects: response time (latency) and service coverage (quality)

sesheta commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale