skonto commented 3 years ago

Describe the change you'd like to see Organizations are rapidly adopting SLOs (service level objectives) as the foundations of their SRE engineering effort. The idea is to view services from a user perspective and follow these steps:

Select metrics that make good SLIs
Use SLIs to create proper SLOs
Use the error budget implicitly defined by your SLO to mitigate risks (out of scope).

Knative Serving provides a number of metrics on which we can create a number of SLIs/SLOs for the end user. Users should know what the available metrics are and should have meaningful SLIs/SLOs ready to use. As a first step we should document the information needed in this repo and then create and/or suggest the right tooling to help users monitor/enforce SLOs.

Additional context

Service Level Indicators (SLIs) are metrics that measure a property of a service. The metrics used to measure the level of service provided to end users (e.g., availability, latency, throughput).
Service Level Objectives (SLOs) are the targeted levels of service, a statement of performance, measured by SLIs. They are typically expressed as a percentage over a period of time. SLOs can be either time-based which means how much time of the measured period we need to achieve our performance target or events-based which means what percentage of the events need to be successful.

A nice blog post on the topic with many details can be found here.

Some examples:

SLI 1: “A service submitted/updated by the user should have its revision in ready state within N ms” This will help us detect issues like revisions not coming up in time. We can report times with a new metric when we reconcile the revision at the controller component.
SLI 2: “A service deleted by the user should have its resources removed within M ms” Same as above.
SLI 3: “Services creating a lot of blocked connections at the activator side should be automatically re-configured within X ms” This SLI requires some auto-tuning capabilities that don't currently exist. The idea is that when CC (current concurrency) is not 0 (infinite) then requests might get blocked and queued at the [activator side] (https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L217) as they are going to be throttled. At some point if queued requests exceed a [limit] (https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L57) errors will be returned. The idea is to be able to detect when the activator is under pressure and proactively configure services to avoid request delays.
SL4: “When switching from proxy to serve mode 95% percentile of latency for the proxied requests (via activator) should be statistically similar to the ones served directly”

It is desirable to have the same behavior for proxied and non-proxied requests. When a request goes through our system, the user shouldnt see a difference compared to the requests that hit his service directly for a high number of requests issued.
SLI 5: “Number of auto-scaling actions that reached the desired pod number”
SLI 6: “Cold start times are below a threshold of N ms” This requires some metric that calculates time spent at pod bootstrapping and user’s app initialization
SLO 1: “99.9% of the services submitted/updated should become ready within N ms and removed with M ms”
SLO 2: “99.9% of the time Serving is working in stable mode for all services”

Resources

Calling out several people who might be interest in this: @evankanderson @markusthoemmes @yuzisun @csantanapr @aslom @mattmoor @grantr @tcnghia @abrennan89

csantanapr commented 3 years ago

I love this topic @skonto and also the resources linked are very useful

jchesterpivotal commented 3 years ago

A nitpick: the SLI examples are SLOs. For example:

A service submitted/updated by the user should have its revision in ready state within N ms

Could be broken into an SLI:

Time between Service applied and Ready=True (SLI-1)

and an SLO:

SLI-1, less than 10ms at 99.9th percentile.

More generally, are we intended to publish suggested SLIs and SLOs that users will monitor themselves, or are we going to measure them ourselves? For the latter, we will need to define the experimental apparatus pretty closely, because a lot of variability can come from sources we'd like to exclude (eg. landing on node with noisy neighbours, running the test during peak hours vs off-hours, etc).

skonto commented 3 years ago

@jchesterpivotal technically it depends imho, I followed what the authors describe in my first link (pls check their chapter on SLIs):

An SLI is useful if it can result in a binary “good” or “bad” outcome for each event, either the service did what the users expected or it did not.

That means given a threshold of N ms I can say if the outcome is my revision to be up (good) or down (bad). How I set that threshold and what metric I use to make a binary decision is the next step. SLOs are the percentage of good events that you want to happen. If you get less out of the total ones you are in a bad state.

In general the line between SLIs/SLOs is not that clear it depends on your metrics and what your organization thinks it should be. Even for the SLI definition there is a difference between what is defined in here Site Reliability Engineering, The Site Reliability Workbook and in the first link above. I have chosen the above because I thought it was the most useful in this context.

More generally, are we intended to publish suggested SLIs and SLOs that users will monitor themselves, or are we going to measure them ourselves?

I would like to see SLIs/SLOs: a) documented with examples and described semantics b) measured eg. have a sample app where we can demonstrate their use and suggest some values for thresholds etc c) provide a mechanism to monitor SLOs and make sure for example Serving control plane follows them by auto-tuning Knative (long term).

Yes we need a setup where we can evaluate our assumptions and can be used as a baseline to demonstrate SLIs/SLOs. The actual values of thresholds for example are of less of interest and could be configurable (SLIs do evolve in time), the important part is to define what we recommend as important SLIs/SLOs to help users understand their workloads, do proper alerting on top of SLOs etc.

evankanderson commented 3 years ago

I'd generally agree with Jacques about SLIs being the measurement and then SLOs being the threshold on the measurement. The nice thing about this is that we can define the SLIs and suggest SLOs, but each installation can make their own determination about what SLOs (reliability) they want, and can document it against common SLIs. Usually the measurement is the hard part until you want a better SLO than you get with your current environment -- raising your SLO means figuring out why things are bad/slow.

I'm trying to figure out metrics to support the above (assuming that you don't just have a stream of all the kubernetes event transitions, which would be another way to do it for control plane events). Some of the goals are easy to define but sometimes hard to measure (like time-to-Ready, which can be thrown off by things like submitting a Revision whose spec.template.container[0].image points to a bad image).

I'd also break up the SLIs into Control Plane and Data Plane SLIs. I think platform administrators are probably interested in both control plane and data plane SLIs, but developers and application operators are probably most concerned about the data plane SLIs, which may actually measure what the application's users are experiencing (so they can use these SLIs to craft application SLOs in addition to platform SLOs).

A few notes on the proposed SLIs.

SLI 1: “A service submitted/updated by the user should have its revision in ready state within N ms” This will help us detect issues like revisions not coming up in time. We can report times with a new metric when we reconcile the revision at the controller component.

I think you'll need a way to distinguish user/system error here. Reporting this is probably also somewhat tricky -- you may be able to measure this by exporting k8s.object.update_age with one or more Condition labels, which would let you query for count(update_age{Ready=False} > X) / (update_age > X), assuming that update_age is NOW() - metadata.$UPDATE_TIME (not sure which field will give you that; you might need to dig in managedFields[].time).

SLI 2: “A service deleted by the user should have its resources removed within M ms” Same as above.

I'm trying to figure out how to compute this given the visibility that we have; we might be able to report on NOW() - metadata.deletionTimestamp as a metric, assuming that there's a finalizer in place. Alternatively, if we're simply relying on kubernetes object GC, this would probably be a Kubernetes cluster health problem.

SLI 3: “Services creating a lot of blocked connections at the activator side should be automatically re-configured within X ms” This SLI requires some auto-tuning capabilities that don't currently exist. The idea is that when CC (current concurrency) is not 0 (infinite) then requests might get blocked and queued at the [activator side] (https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L217) as they are going to be throttled. At some point if queued requests exceed a [limit] (https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L57) errors will be returned. The idea is to be able to detect when the activator is under pressure and proactively configure services to avoid request delays.

As you point out, we don't have these capabilities today; furthermore, it's not clear what "proactively configure services" means here -- change containerConcurrency or some other item in spec?

SL4: “When switching from proxy to serve mode 95% percentile of latency for the proxied requests (via activator) should be statistically similar to the ones served directly”

It is desirable to have the same behavior for proxied and non-proxied requests. When a request goes through our system, the user shouldnt see a difference compared to the requests that hit his service directly for a high number of requests issued.

Would you be comfortable rewriting this as "The 95th percentile infrastructure latency contribution is < Xms", or is there some other behavior you're trying to capture here (e.g. equivalent request distribution)?

Measuring this from inside Knative will be somewhat tricky, particularly since not all the HTTP ingress mechanisms expose a latency number, so Knative code may only be able to expose latency from the activator/queue-proxy onwards.

SLI 5: “Number of auto-scaling actions that reached the desired pod number”

I'm not sure what this is trying to count, because the desired number of replicas is a function of supplied traffic -- is this simply "number of adjustments to requested replicas" or "difference between autoscaled desired and current"?

SLI 6: “Cold start times are below a threshold of N ms” This requires some metric that calculates time spent at pod bootstrapping and user’s app initialization

I think this could be measured as "latency of activator requests that caused a kubernetes action" or even "latency of activator requests which had to be queued for some time".

SLO 1: “99.9% of the services submitted/updated should become ready within N ms and removed with M ms”

There are two different SLOs here you might want:

Latency of the control loop on moving to the next step (how responsive are the controllers to changes)
Latency between "new desired state" and a final state as observed by Conditions.

SLO 2: “99.9% of the time Serving is working in stable mode for all services”

I'm not sure what "stable mode" is, but I could see per-Service or per-Revision targets like:

Serving introduces no more than Xms latency at 50/95/99
Serving introduces fewer than X% errors

Note that this is separate from the number of errors served by the container -- and some containers could (for example) never actually listen on the requested $PORT or never actually send an HTTP response (for example, a container running nc might write all the bytes received to a log, but never send back a response).

skonto commented 3 years ago

I'd generally agree with Jacques about SLIs being the measurement and then SLOs being the threshold on the measurement.

So it depends on what you define is a measurement. Service level objective is the target you want to set for how long you want to have a service in good state that means for our case we have two thresholds, one for the SLI and one for the SLO which is the percentage we want to achieve during a time period where the service is used and it is in good state. So technically there are two thresholds in this specific case. Personally I like better this definition for having a binary measurement and then apply a target. The alternative way is to define SLOs as follows (Jacques suggestion):

"Target: The target defines the planned goal to achieve in terms of service delivery. A target could be, for example, that 99.99% of all service calls must return without an error, or that 95% of all service requests must be fulfilled in under 2 seconds’ response time."

But some other vendors define it differently as I suggested: "SLI: X should be true." If you measure latency for example according to this definition the pure metric value is not enough.

So it depends, I liked what the authors suggested in that book because I considered it clear, also the authors underline the fact that SLIs are defined differently elsewhere. Now defining the first threshold might be hard because it may depend on the env, but we could have a generous value, eg, if reconciliation happens after 10 minutes probably something is really wrong. My idea is that as developers of Knative we should know what to configure/suggest for some of these thresholds to pre-configure them as a baseline, otherwise people will have to do the hard work to go figure out all the numbers. Btw for maximum flexibility everything should be configurable because you may have a slow env but this might acceptable from a user perspective.

I'd also break up the SLIs into Control Plane and Data Plane SLIs. I think platform administrators are probably interested in both control plane and data plane SLIs,

Yes I am planning to create a spreadsheet to capture this.

As you point out, we don't have these capabilities today; furthermore, it's not clear what "proactively configure services" means here -- change containerConcurrency or some other item in spec

We could have a component to enforce SLOs when possible. Yes changing that parameter is an option, here is some related work.

I'm not sure what this is trying to count, because the desired number of replicas is a function of supplied traffic -- is this simply "number of adjustments to requested replicas" or "difference between autoscaled desired and current"?

My thinking was to capture to what extend we fulfilled the user's autoscaling needs, so it should be the latter, eg if resources dont exist to scale the replicas this should be expressed in such a SLI.

I'm not sure what "stable mode" is, but I could see per-Service or per-Revision targets like:

Stable mode is the opposite of panic mode.

evankanderson commented 3 years ago

I'm a big fan of SLIs measured from the user perspective. "Latency" isn't an SLI, because it's multi-dimensional (each event has a latency, but it's not clear how to combine them into a measure that you can threshold. Some example SLIs you could extract from latency data are:

99th percentile latency (measure in time units, the SLO would be "is less than X seconds")
number of requests with less than 50ms infrastructure-added latency (a percentage of requests)

My experience is that the second SLI is a better choice for SLOs, because it makes it easier to measure "how much did we miss our SLO by". (I learned the "convert to percentage" trick from someone else during a discussion of error budgets after many years of measuring Nth percentile for performance graphs.)

The nice thing about the second SLI is that you can measure it over different periods of time in a fairly natural way if you want both daily and weekly error budgets, and you can say things like "we spent 5 days of error budget on that 3 hour incident". Working the math with 99th percentile latency gets much trickier. 😁

In terms of autoscaler being in/out of panic mode, I wouldn't view that as something that we'd want to suggest an SLO on. Panic mode is a solution to certain external situations (despite the name), so I'd tend to instead focus on things like "added latency" (which would include time to start pods) or "overhead (additional pods) below threshold" without talking about the internals of the current autoscaling system.

There's a second "cut" here as well -- both our developer persona (and by extension, the business that's developing the application) and the cluster administrator persona (which might be a vendor or might be self-hosted) care about SLIs and SLOs, but probably different SLOs.

My own first pass based on what I think we can measure:

Data Plane

Serving

Platform admin

Infrastructure-added latency < 20ms
Infrastructure-added latency < 2s (container start)
Efficiency (requests / pods * concurrency)

Developer

Total application latency < X (chosen by app developer)
Successful (2xx+3xx) requests vs total requests

Eventing

Platform Admin

Delivery latency < 100ms
Queue length / current pending-delivery events

Developer

Delivery latency < X (app dependent)
Successful (non-dead-letter) delivery / all deliveries

Control Plane

By and large, I suspect that control plane availability as an SLO only matters to platform administrators; individual developers may be upset if the control plane isn't working, but they (or the business) don't have direct levers on that. I'm also assuming that since the control plane is largely all Kubernetes controllers, that the same SLIs will generally apply, possibly with different latency thresholds for different controllers.

It might also be interesting here to coordinate with upstream K8s work in terms of exposing common metrics in the long-term, but we shouldn't let that block getting something into controllers now.

Latency between generation updated and reaching a "settled" state (observedGeneration=generation and Ready=True or Ready=False) < 1m
Number of in-progress (non-settled) reconciliations (used to measure amount of churn in cluster)
Total number of resources (used to indicate max capacity)
Latency for an individual resource reconciliation < 1s

maximilien commented 3 years ago

See: https://github.com/knative/operator/issues/452

skonto commented 3 years ago

@evankanderson let's start with some of the above and try to measure them. Assuming we have the metrics in place how can we measure all these based on some reference infra? Should we have a dedicated cluster with some sample app?

evankanderson commented 3 years ago

@maximilien

evankanderson commented 3 years ago

/assign @maximilien

evankanderson commented 3 years ago

/remove-triage needs-eng-input

skonto commented 3 years ago

The related feature track doc is here. Operations WG is working on some similar stuff, after the doc is reviewed we need to discuss how to validate SLIs/SLOs. @maximilien wdyth? Has the group made any progress?

abrennan89 commented 2 years ago

Any updates on this issue? Is there any actual documentation work still to be completed, or can this move to a different repo until there is?

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

skonto commented 2 years ago

/remove-lifecycle stale

abrennan89 commented 2 years ago

@skonto bump, any update on this?

skonto commented 2 years ago

@abrennan89 no I will get back to this next year, there is some work to be merged in the upstream feature track doc for the developer persona based on what I have worked on downstream. Then we can create documentation for reference.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

skonto commented 2 years ago

/remove-lifecycle stale

abrennan89 commented 2 years ago

@skonto can you close this issue or move it to serving and instead open specific docs issues for this once there's something to document?

evankanderson commented 2 years ago

/transfer-issue serving

abrennan89 commented 2 years ago

Issue moved to knative/serving #13056 via ZenHub

knative / docs

Define Serving SLIs/SLOs #3140

Data Plane

Serving

Platform admin

Developer

Eventing

Platform Admin

Developer

Control Plane