knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

Initial scale of a new revision should be able to handle incoming load #4665

Closed greghaynes closed 3 years ago

greghaynes commented 5 years ago

In what area(s)?

/area autoscale /kind proposal

Describe the feature

Assume I have a configuration with revision1 which is serving some QPS via a runlatest route requiring replicaCount > 1 of it's deployment. I then create revision2 for this configuration. All of the traffic for this route will be redirected to revision2 once 1 instance is ready and the autoscaler will not begin to scale up replicaCount until this point. This will cause a disruption of service, especially when replicaCount of revision1 >> 1, while the additional replicas are scaled up.

Our autoscaler should be smarter about initially scaling up a revision to match the incoming load if we have that information.

duglin commented 5 years ago

if all requests to revision1 were due to a revision specific URL and not to the KnService's URL, will the auto-scaler be able to tell? Because in that case revision2 won't get any traffic at all so keeping it's initial scale small would be more appropriate.

markusthoemmes commented 5 years ago

I like the idea in general. I wonder however what our preferred model is in this case. Wouldn‘t we advocate a gradual cutover via traffic splitting (which is one of the reasons we require traffic splitting in the first place).

Scaling both deployments to match traffic first is wasteful in terms of the resources needed.

greghaynes commented 5 years ago

Yes, I am intentionally trying to avoid saying "copy the scale of the previous revision" for this reason - its actually a calculation that would need to take routes in to effect, and TBH I don't have a great answer.

greghaynes commented 5 years ago

I think a good way to look at it is - I'd expect us to have a story for "I want to run my app in knative without doing advanced routing and be able to update it without disruption". I'd expect runLatest to exhibit this behavior given that is our "just DTRT" config option for routing. Maybe we can special case when a configuration is only pointed to by runlatest routes? For more advanced cases I think its fine if we say that users have to be more careful to shed load between revisions gracefully, but I think the out of the box experience for the simple case should work for updates.

vagababov commented 5 years ago

It was part of my explorations document. So the blunt solution to scale up and then down, if all the traffic was arriving at the revision specific URL is a bit expensive, but the least likely to cause any disruptions due to traffic migrations.

If the deployment is run-lastest, that should be the way to go at least initially (and possibly guarded via feature flag) If the deployment is green-blue (or whatever traffic split there is) -- we can compute the expected traffic volume by aggregating over existing traffic volume and taking the appropriate percentage and scaling to that amount of traffic.

In any case, this feels like a very non trivial change, so probably we'll do it in post 1.0 world.

/cc @mattmoor

On Tue, Jul 9, 2019 at 11:12 AM Gregory Haynes notifications@github.com wrote:

I think a good way to look at it is - I'd expect us to have a story for "I want to run my app in knative without doing advanced routing and be able to update it without disruption". I'd expect runLatest to exhibit this behavior given that is our "just DTRT" config option for routing. Maybe we can special case when a configuration is only pointed to by runlatest routes?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knative/serving/issues/4665?email_source=notifications&email_token=AAF2WX22H43GBM7FMMTHRQLP6TIIRA5CNFSM4H7H7HBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRCMLI#issuecomment-509748781, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF2WX6NLYXXDMSG5QXY32DP6TIIRANCNFSM4H7H7HBA .

vagababov commented 5 years ago

/cc @mattmoor actually make prow do it:)

mattmoor commented 5 years ago

There are several motivations for wanting to establish this sort of association across Revisions in the same linear history; another that immediately comes to mind is applying VPA recommendations.

markusthoemmes commented 5 years ago

@greghaynes your last comment mostly sounds like you're after rollout orchestration here? As in: Slowly cutting over from one revision to the other without making the user do it?

greghaynes commented 5 years ago

I think our out of the box rollout orchestrating (route all traffic to latest revision) should work for the helloworld example without major disruption of requests on an update. IMO for more advanced usages its fine to tell the user they need to be aware of these kinds of issues and account for them, but we should have a (very) good story for "I have a stateless app I want to run on knative and always route traffic to the latest revision".

markusthoemmes commented 5 years ago

Yes that makes sense to me. I was mostly voicing out what wasn't apparent to me from the title: You actually want to be able to cut over between two revisions automatically without interruptions for the simple use-cases. The title (initial scale should be able to...) implied that that's somehow bound to the scale carrying over, even though you tried to avoid that 😅

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

knative-housekeeping-robot commented 4 years ago

Stale issues rot after 30 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle rotten. Rotten issues close after an additional 30 days of inactivity. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

mattmoor commented 4 years ago

/remove-lifecycle rotten

mattmoor commented 4 years ago

So today we label the Revisions that are routable with the Route's label. We key off of this label to establish whether or not to apply minScale.

What if we used this label as a second virtual key for aggregating metrics? We would start to get a sense for the traffic flowing to a Route across the lifetime of many Revisions.

Now suppose that the Route also plumbed through the fraction of that traffic that was aimed at a particular Revision? Modulo latency differences we would know immediately the volume of traffic we're about to aim at the new Revision, and can make a scaling decision accordingly. [1]

This ofc has an ordering problem: It would be better to not shift the traffic until the new Revision was adequately prepared to handle it.

One thought (that I haven't fully thought thru) is: what if we treated {virtual key total}*{fraction of traffic} as a sort of virtual minScale annotation? If we prime the Revision this way (e.g. annotate before actually shifting traffic), and this causes the Revision to become non-Ready (as unsatisfied minScale does) then the Route's traffic shifting logic will hold until the Revision becomes Ready... (I don't like the lack of positive hand-off w/ autoscaler, but 🤷‍♂ )


[1] - A corollary here is that this information would also provide us with exactly what we'd need to fractionally scale Revisions in a traffic split with KEDA. e.g. a 70/30 split could add a 0.7 multiplier on the queue-depth for that Revision, and 0.3 on the queue-depth of the other.

cc @vagababov @markusthoemmes to tell me how crazy I am 😂

vagababov commented 4 years ago

Basically, as I mentioned before, I want us to do the following

pc := |current rev|
tc := current_target
nc := new_target
|new_rev| := tc*pc/nc
|new_panic| := true // -> to make sure we don't immediately scale down, while traffic is being shifted
markusthoemmes commented 4 years ago

What makes this super complex is the N:M connection of routes and revisions. Any revision can be subject to N routes at any time and it's non-trivial (in our current system) to find out which how much traffic will be pointing to the new revision.

This decision is only fairly simple if we only have a revision stamped out by a service that only has a latestRevision: true route. In this case, we could just infer the new scale from the revision right before the new one. In all other cases, all bets are kinda off and we'd need to construct a graph to collect data from all revisions that are subject to a route and then accumulate that to what the new revision would receive after a change.

Possible? Probably. But... I dunno.

It feels like the right mechanism for us is to do this is to add orchestration to our traffic splits to be able to make them smoothly roll over rather than hammering the new revision immediately.

mattmoor commented 4 years ago

Another option: It would be possible for us to implement a GradualRollout CRD that mimics the rollout behavior of latestRevision: true. Effectively:

apiVersion: serving.knative.dev/v1alpha1
kind: GradualRollout
spec:
  foo: # Duck for extracting LatestReadyRevision from status
    apiVersion: serving.knative.dev/v1
    kind: Configuration
    name: asdf

  bar: # Duck for manipulating spec.traffic
    apiVersion: serving.knative.dev/v1
    kind: Route
    name: asdf

  params:  # totally random knobs, don't attack me :)
    increment: 3
    frequency: 30s
    ...

Either foo or bar should default to the other.

mattmoor commented 4 years ago

So I was thinking about GradualRollout last night, starting to throw together a PoC, and thinking about three things, which aren't easy as specified above:

  1. How do you stop a rollout or rollback?
  2. How do you traffic split over two configurations?

Honestly, the solution I liked best was to adopt the Route's traffic: block syntax directly:

traffic:
- configurationName: foo
  percent: 30
- configurationName: bar
  percent: 30
- revisionName: bar
  percent: 40

The basic idea is that revisionName is respected as-is, and configurationName rollouts are done gradually as specified in TBD knobs. How tag: translates would be TBD.

The more I thought about this approach and played with prototyping it, the more I became convinced that this should just be a part of Route, where omitting things gets you what we have today. So suppose we add to Route:

spec:
  rollout:
    increment: 1
    frequencySeconds: 5
  traffic:
  - configurationName: foo
    percent: 100

This would progress something like this. Original:

spec:
  rollout:
    increment: 1
    frequencySeconds: 5
  traffic:
  - configurationName: foo
    percent: 100
status:
  traffic:
  - revisionName: foo-old
    percent: 100

We create a new revision:

spec:
  rollout:
    increment: 1
    frequencySeconds: 5
  traffic:
  - configurationName: foo
    percent: 100
status:
  traffic:
  - revisionName: foo-old
    percent: 100
  - revisionName: foo-new
    percent: 0

5 seconds after above:

spec:
  rollout:
    increment: 1
    frequencySeconds: 5
  traffic:
  - configurationName: foo
    percent: 100
status:
  traffic:
  - revisionName: foo-old
    percent: 99
  - revisionName: foo-new
    percent: 1

The default rollout: would be an increment of 100 and a frequencySeconds of 0.

cc @tcnghia @dgerd

vagababov commented 4 years ago

This has the same question as you noted above. What happens if I roll out one more revision while this one is in progress? Do we start cascading or jumping?

dgerd commented 4 years ago

The more I thought about this approach and played with prototyping it, the more I became convinced that this should just be a part of Route, where omitting things gets you what we have today.

This doesn't seem to solve your original question of "How do you stop a rollout or rollback?"

Having spent very limited time thinking about this.

  1. I like the GradualRollout object. It feels very much like the K8s HPA and having a separate object acting upon an existing object is an easier mental model for me than an object mutating state within itself.
  2. Acting upon metrics seems like the best option to stop a rollout or rollback.
  3. I am not convinced that I like Route's traffic syntax the best. It is good at showing desired state, but not desired movement. Since we take n different routes I think I would prefer something that gave more of a from (where are we decreasing traffic from) and to (where are we sending traffic to now) syntax.

A stab at something different for the sake of discussion

spec:
  route: my-route
  rollout:
    incrementPercent: 1
    frequencySeconds: 30
  metrics:
     .... #TBD
  from:
   - revisionName: foo
  to:
   - configurationName: bar

Limitations:

Pros:

mattmoor commented 4 years ago

This doesn't seem to solve your original question of "How do you stop a rollout or rollback?"

Sure it does, same as I would rollback today. I post a traffic block with the revisionNames I want serving traffic while I fix things. 😉

vagababov commented 4 years ago

Is anyone interested in this for 0.14 work? :)

evankanderson commented 4 years ago

I’m wondering if this is actually two enhancements: one KPA enhancement, and one Route enhancement.

  1. KPA should be able to figure out a [requests -> instances ratio] and [the current sum of traffic for a Route], and use that as a scaling target for the new Revision.
  2. Route should be able to get a “capacity” number from a Revision, and apply the desired state only up to the capacity limit (while the KPA is driving capacity towards the desired Route state).

The way I imagine this would work is KPA computes

target = total instances * max-assignment(spec, status)

and Route computes

status.assignment = requests / current-capacity

This would require KPA to report the estimated current capacity in some way.

Result is that you point the Knative spec where you want to go, but the system doesn’t steer outside its performance envelope.

I'm concerned about requiring additional policy objects to do a simple thing safely. It seems like the system should default to "safe", and steering outside that should be the thing that requires extra equipment to effect.

jchesterpivotal commented 4 years ago

As a general observation, safety is harder to achieve with a "push" system than a "pull" system, where replicas ask for requests rather than receiving them.

Estimating capacity is doable and rough estimates can be used to improve behaviour of a push-based system. A good reference is this presentation by @stevegury.

But you're still going to face Kingman's formula, aka the VUT formula, which demonstrates that slowdowns are exponential as you approach 100% utilisation and that the point at which the slowdown curve inflects sharply upwards gets earlier if you have higher demand variability or higher variability of the underlying process.

In Skenario I modelled the delay caused by lots of requests landing simultaneously using Sakasegawa's approximation; to avoid my charts blowing up I arbitrarily clamped utilisation to 96%. Even then what you'd observe was that the early requests have a long wait for service (cold start time + slow processing while heavily contended), after which it would level out.

vagababov commented 4 years ago

@evankanderson , that has been my suggestion all along. We already know the capacity via all routes by virtue of knowing how much traffic is going to the existing revision over all possible routes.

Now, if we inject that information into new autoscaler object on creation, we can provide initial scale and start in panic mode to ensure no scaledown happens.

The problem is to figure out which traffic is going to be migrated, i.e. from which of the revisions and, in theory what % of that revision (i.e. I have the revision is targeted via 2 routes but only one of them is going to be replaced by the new revision -- so we need to scale only to a fraction of the old revision).

That is simple in the run-latest model, but harder in others. Basically we need to diff the routes and infer the traffic that way...

So it seems like it's something that should compute in route controller first, before passing that to the revision/kpa controllers.

evankanderson commented 4 years ago

Do we need to figure out which traffic goes where, or simply use the max(spec, status) information from the Route(s) contributing traffic to determine how much load could be assigned to a given Revision?

Note that this is not "optimal" from a resource-usage perspective, because you may end up committing resources to both the "source" and "destination" Revision when moving traffic between them, but I'd argue that overcommit while moving traffic to preserve SLO is better than undercommit to minimize resource usage.

Example, assuming capacity for 8 pods in a cluster:

  1. Initial state: Revision spec percent status percent num pods desired pods
    A 100 100 6 6
  2. Request full rollout of B Revision spec percent status percent num pods desired pods
    A 0 100 6 6
    B 100 0 0 6
  3. Initial pods come up Revision spec percent status percent num pods desired pods
    A 0 100 6 6
    B 100 0 2 6
  4. Route reacts and drives traffic over Revision spec percent status percent num pods desired pods
    A 0 66 6 6
    B 100 33 2 6
  5. KPA reacts and scales down A Revision spec percent status percent num pods desired pods
    A 0 66 6 4
    B 100 33 2 6
  6. Kube tears down pods from A, allowing B to schedule Revision spec percent status percent num pods desired pods
    A 0 66 4 4
    B 100 33 4 6
  7. Route rebalances based on pods available Revision spec percent status percent num pods desired pods
    A 0 33 4 4
    B 100 66 4 6
  8. KPA scales A down again: Revision spec percent status percent num pods desired pods
    A 0 33 4 2
    B 100 66 4 6
  9. After the next Kube + route rebalance, plus KPA reaction (I'm combining three steps here because the tables are getting old) Revision spec percent status percent num pods desired pods
    A 0 0 2 0
    B 100 100 6 6
  10. And then Kube cleans up the last two pods from A.

Note that this will have the side effect in a large cluster of temporarily consuming a large amount of capacity during a "hot runLatest" deployment. It should be possible to use Pod limits on the namespace level to mitigate this effect.

vagababov commented 4 years ago

Mostly along those lines, but:

  1. we need to make sure we don't scale down B, while the traffic shift is happening, since KPA is pretty aggressive. Hence my suggestion of starting new revision in panic mode. Or we can invent new mode of operation, of course.
  2. The new revision is not necessarily should be scaled to 6 (in your example), but to RevATotalCapacity/RevBTargetCapacity, since CC/Target/TU all can change from A to B.

Which brings to the question:

How does route controller know how much of traffic of podA can be shifted to an instance of revision B?

evankanderson commented 4 years ago

With respect to 1, if we set KPA's target requested out of Kubernetes to:

requests * max(spec.assignment, status.assignment) / observed-requests-per-container)

I don't think we need to start in panic mode; we can start directly at what we think we need (i.e. 6 pods in this situation). Note that this requires deriving observed-requests-per-container for revision B based on data from revision A until we have enough information about B's behavior.

Good point that ContainerConcurrency and targets can change from one Revision to another. I'd assumed a code rollout without changing those parameters in my example, but we might need to determine a scaling factor between A and B if any of the autoscaling parameters change between the two revisions.

vagababov commented 4 years ago

If we don't start in panic mode in 2s (AS reconcile period) it'll suggest scale less than 6 and immediately apply that. I think we need it. Obviously if we trick observed-requests-per-container to be a from revision A, we can avoid it. But I think starting in panic mode is more natural. But I guess that's debatable.

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

duglin commented 4 years ago

@vagababov should we keep this one alive?

/cc @julz

vagababov commented 4 years ago

/remove-lifecycle stale /lifecycle frozen Yes we should keep this.

vagababov commented 4 years ago

Actually /assign Let me see what I can do here

markusthoemmes commented 4 years ago

Moved to 0.18 as I believe @vagababov is actively working on this.

markusthoemmes commented 3 years ago

@vagababov is this done per your recent work?

vagababov commented 3 years ago

95% Docs and more tests remain.

dprotaso commented 3 years ago

@vagababov For the remaining work are you still tackling it?

Otherwise are you able to write up some new issues for other folks to pick up?

vagababov commented 3 years ago

/close

Done.

knative-prow-robot commented 3 years ago

@vagababov: Closing this issue.

In response to [this](https://github.com/knative/serving/issues/4665#issuecomment-795816962): >/close > >Done. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.