With the resourceUtil strategy, traffic transfer does not occur during an update when deployment hits quota limit

AyushSawant18588 commented 1 month ago

In a K8s cluster with 1 GPU Initially, we apply isvc with replicas as 1

NAME                                                      READY   STATUS    RESTARTS   AGE
deploy1-predictor-00001-deployment-868df87c79-6k4sx   2/2     Running   0          79s

Then we update isvc replicas to 2.

With the one update, we notice 2 revisions. Initial State - Revision 1 After Update: Revision 2

Transitions noticed:

All pods of Revision 1 terminated instantly, but traffic remained to be directed to Revision 1.
One pod of revision 2 starts to run as only one GPU is available while another pod is pending. The traffic is still directed to revision 1.

NAME                                                      READY   STATUS    RESTARTS   AGE
deploy1-predictor-00002-deployment-5b7d9c4f7-4fwz8    1/2     Running   0          21s
deploy1-predictor-00002-deployment-5b7d9c4f7-w9slk   0/2     Pending   0               21s

Traffic:
        Latest Revision:  false
        Percent:          0
        Revision Name:    deploy1-predictor-00001
        Latest Revision:  true
        Percent:          100
        Revision Name:    deploy1-predictor-00001

If in this state (before all containers in one pod of revision 2 are running), an inference request is sent. This triggers Knative to spawn a pod in Revision 1 due to the route.


NAME                                                      READY   STATUS    RESTARTS        AGE
deploy1-predictor-00001-deployment-f4cd9f5c4-g6g7h    0/2     Pending   0               3m21s
deploy1-predictor-00002-deployment-587c5d876c-w9slk   0/2     Pending   0               4m5s
deploy1-predictor-00002-deployment-587c5d876c-4fwz8   2/2     Running   4 (2m24s ago)   4m6s

Traffic: Latest Revision: false Percent: 0 Revision Name: deploy1-predictor-00001 Latest Revision: true Percent: 100 Revision Name: deploy1-predictor-00001



- Since the GPU is used by revision 2 pod revision 1 is stuck in a pending state.
- At this stage, even though the GPU is being used and one revision 2 pod is in the running state, the inference requests fail, and this stage does not correct itself.

This behavior happens during an update for an isvc that has replicas as 1 and is deployed in a cluster with no extra resources.

Is this the expected behavior for the scenario mentioned above?

AyushSawant18588 commented 1 month ago

I went through the service, rollout-orchestrator and stagePodAutoscaler reconciler codes and found this: In the service reconciler, traffic percentage and target replicas are calculated for the revision scaling up and down in each rollout stage. Whenever there is an update in service where the current revision has 1 min replicas the stageTrafficDelta calculated is 100 which means that 100 percent of the traffic is to be shifted from current revision to new revision so this rollout stage the current revision’s percent is done nil and target replicas is 0 and the new revision’s percent is 100 and target replicas is its min replicas. Code reference Because of 0 target replicas the stage pod autoscaler sets min and max replicas as 0 and 1 respectively for current revision so the pods of current revision are terminated before the pods of new revision can become ready. Since in the new revision the actual replicas are less than desired replicas the route still directs the traffic to old revision till all the pods of new revision become ready which is causing the problem mentioned in this issue.

To address this problem, we have the following suggestion: When the traffic shift scenario described above occurs (this case is true), instead of setting the traffic percentage to nil and target replicas to 0 for the current (last) revision, set the traffic percentage to 0 and the target replicas to 1. For the new revision, set the traffic percentage to 100 and the target replicas to 1. This adjustment ensures a graceful traffic transfer, as detailed below:

In the first rollout stage, the last revision's target replicas will be 1, and the new revision's target replicas will also be 1. At this stage, the route directs 100 percent of the traffic to the last revision.
Once a pod of the new revision is ready, the revision is considered ready, completing this stage.
In the next rollout stage, the last revision's target replicas will be set to 0, causing its termination, and the route directs 100 percent of the traffic to the new revision and the new revision's target replicas will be adjusted to its minimum replicas. This approach ensures a smoother and more reliable transition of traffic between revisions. Is a change like this feasible?

houshengbo commented 3 weeks ago

@AyushSawant18588 In the suggestion you mentioned, it is more like the strategy Availability, which is working by scaling up and new revision first and then scaling down the old revision(in each stage). The purpose of this strategy is to make sure there is no service downtime, and we can tolerate the total resources to be consumed can be higher than what are currently being used.

The strategy resourceUtil works the opposite way, which scaling down the old revision first and then scaling up the new revision(in each stage). The purpose of this strategy is to make sure that the rollout will NEVER consume more resources that what are currently being used.

It is better to keep the basic mechanisms as above to implement both of the strategies in each of their stages.

The strategy resourceUtil is more vulnerable so far comparing to the availability.

houshengbo commented 3 weeks ago

I will verify this use case, because I am still not sure what could be the reason for this behavior, or whether it is expected or not.

AyushSawant18588 commented 2 weeks ago

@houshengbo Thanks for your response. But we are seeing issues in both strategies: when quota limit is hit, a stuck state is reached which is not resolved on itself. To summarize our testing and the main issues. We used a resource constrained cluster

In case of Availability Strategy We noticed on quota limit hit, both revisions were present and were stuck even after timeout. Overall service at pending state scenario - 200

NAME               READY      STATUS     RESTARTS AGE
predictor-00001-xyz       2/2   Running         0   
predictor-00002-abc      0/2    Pending         0   
predictor-00002-xdqq     2/2    Running         0

In ResourceUtil Strategy We noticed on quota limit hit, the older revision deleted and new revision partially up. The service was at Active state, with traffic pointing to old revision. issue - 203

NAME                   READY       STATUS       RESTARTS AGE
predictor-00002-abc      0/2    Pending         0   
predictor-00002-xdq      2/2    Running         0   

Traffic:
    Latest Revision: false
    Percent:     0
    Revision Name:  predictor-00001
    Latest Revision: true
    Percent:     100
    Revision Name:  predictor-00001

The rollout process does not recover from these states for both strategies.

knative-extensions / serving-progressive-rollout

With the resourceUtil strategy, traffic transfer does not occur during an update when deployment hits quota limit #203