[SIG Services][Spike] Collect SLOs from various managed services

tumido commented 2 years ago

Until next SIG Services on , pick a RH managed product (or any other managed service) and get a list of their SLOs so we can compare and derive an ultimate list of SLOs that can be used by service owners as an inspiration for their own SLOs.

If you pick a service, open a comment in here stating the service name (so we don't end up with multiple people working on the same service).

SamoKopecky commented 2 years ago

Example of SLOs for Google Dataplex.

Availability/Uptime

Uptime is specified in monthly uptime percentage (e.g. 99.5%).
Uptime is the state of the service when downtime is not present, downtime is defined as a state where the error rate is bigger than some percentage for a longer period of time (downtime period). So if the error rate is more than 5% that means the state of the service is transitioned to a downtime period.
Downtime period is defined as the length required for the service to be actually not functioning, so for example if the downtime period lasts more than 10 minutes it will count towards the uptime calculations.
Error rate is the number of requests that result in a response of HTTP status code 500 divided by the number of valid results.
Latency/Response Time
The proportion of valid requests served faster than a threshold. For example threshold of 100ms.
Requests are valid if they are not responded to with an HTTP status code of 500.
This is not contained in the Google Dataplex SLOs, but it was a common occurrence amongst other SLOs I found.

Example summary of SLOs

Uptime -- >=99.5%
Error rate -- <=5%
Downtime period -- <=10 minutes
Latency -- <= 100ms

There are many more examples of Google SLAs (SLOs) here.

tumido commented 2 years ago

I took a look at AppSRE for their SLOs:

Internal resources: https://source.redhat.com/groups/public/sre-services/sre_services_wiki/appsre_slos https://service.pages.redhat.com/dev-guidelines/docs/appsre/onboarding/creating-slos/

Schema: https://github.com/app-sre/qontract-schemas/blob/main/schemas/app-sre/slo-document-1.yml

Example: https://gitlab.cee.redhat.com/service/app-interface/-/blob/a1313ecc522abac8db6612069507e9c4bf934fe7/data/services/app-interface/slo-documents/app-interface.yml

Availability/Usage metrics:

Reconciliation times of app-interface integration
Time to onboard/time to merge to app-interface when self service
Time to onboard/time to merge to app-interface when it requires approval
Ticket response time
Operation toil time
Slack support time to respond
Degradation service time limit/error budget
Tenant service degradation time limit/error budget in each environment (prod/stage/beta/else)

Gkrumbach07 commented 2 years ago

Red Hat OpenShift Streams for Apache Kafka Service Def: https://access.redhat.com/articles/6473891 General Service Terms and Conditions: https://www.redhat.com/licenses/Appendix_4_Red_Hat_Online_Services_20211021.pdf

Summary

Availability

Service availability: Red Hat maintains a 99.95% availability for its General Availability cloud services
Maintenance: Red Hat may perform periodic maintenance to the Online Services and to systems supporting them
Credits for lack of availability Table 2(b) - Terms and Conditions:

Support See this table for details: https://access.redhat.com/support/offerings/production/sla

Support coverage
support contacts
support response time

Performance

Capable of scaling to their defined service limits.

Kafka specific limits See limits here: https://access.redhat.com/articles/5979061

There are more limits but they are specific to a Kafka cluster and not general to a service

codificat commented 2 years ago

The Thoth team has WIP/draft documentation on defining SLOs for the service.

There are still no final details on that, but here is a list of SLIs that are listed as the focus:

thamos advise (quality and latency)
learning rate on unknown application stacks (latency, quality)
Kebechet dependency update (quality, latency, cost)
Amun Inspection Runs (quality, latency) quantity?
Khebhut app workflows (quality, latency) (higher level as they include different tasks)

The quick summary is that the overall focus is in 2 aspects: response time (latency) and service coverage (quality)

sesheta commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

open-services-group / community