Open AyushSawant18588 opened 1 month ago
I went through the service, rollout-orchestrator and stagePodAutoscaler reconciler codes and found this: In the service reconciler, traffic percentage and target replicas are calculated for the revision scaling up and down in each rollout stage. Whenever there is an update in service where the current revision has 1 min replicas the stageTrafficDelta calculated is 100 which means that 100 percent of the traffic is to be shifted from current revision to new revision so this rollout stage the current revision’s percent is done nil and target replicas is 0 and the new revision’s percent is 100 and target replicas is its min replicas. Code reference Because of 0 target replicas the stage pod autoscaler sets min and max replicas as 0 and 1 respectively for current revision so the pods of current revision are terminated before the pods of new revision can become ready. Since in the new revision the actual replicas are less than desired replicas the route still directs the traffic to old revision till all the pods of new revision become ready which is causing the problem mentioned in this issue.
To address this problem, we have the following suggestion: When the traffic shift scenario described above occurs (this case is true), instead of setting the traffic percentage to nil and target replicas to 0 for the current (last) revision, set the traffic percentage to 0 and the target replicas to 1. For the new revision, set the traffic percentage to 100 and the target replicas to 1. This adjustment ensures a graceful traffic transfer, as detailed below:
@AyushSawant18588 In the suggestion you mentioned, it is more like the strategy Availability
, which is working by scaling up and new revision first and then scaling down the old revision(in each stage). The purpose of this strategy is to make sure there is no service downtime, and we can tolerate the total resources to be consumed can be higher than what are currently being used.
The strategy resourceUtil
works the opposite way, which scaling down the old revision first and then scaling up the new revision(in each stage). The purpose of this strategy is to make sure that the rollout will NEVER consume more resources that what are currently being used.
It is better to keep the basic mechanisms as above to implement both of the strategies in each of their stages.
The strategy resourceUtil
is more vulnerable so far comparing to the availability
.
I will verify this use case, because I am still not sure what could be the reason for this behavior, or whether it is expected or not.
@houshengbo Thanks for your response. But we are seeing issues in both strategies: when quota limit is hit, a stuck state is reached which is not resolved on itself. To summarize our testing and the main issues. We used a resource constrained cluster
In case of Availability Strategy We noticed on quota limit hit, both revisions were present and were stuck even after timeout. Overall service at pending state scenario - 200
NAME READY STATUS RESTARTS AGE
predictor-00001-xyz 2/2 Running 0
predictor-00002-abc 0/2 Pending 0
predictor-00002-xdqq 2/2 Running 0
In ResourceUtil Strategy We noticed on quota limit hit, the older revision deleted and new revision partially up. The service was at Active state, with traffic pointing to old revision. issue - 203
NAME READY STATUS RESTARTS AGE
predictor-00002-abc 0/2 Pending 0
predictor-00002-xdq 2/2 Running 0
Traffic:
Latest Revision: false
Percent: 0
Revision Name: predictor-00001
Latest Revision: true
Percent: 100
Revision Name: predictor-00001
The rollout process does not recover from these states for both strategies.
In a K8s cluster with 1 GPU Initially, we apply isvc with replicas as 1
Then we update isvc replicas to 2.
With the one update, we notice 2 revisions. Initial State - Revision 1 After Update: Revision 2
Transitions noticed:
Traffic: Latest Revision: false Percent: 0 Revision Name: deploy1-predictor-00001 Latest Revision: true Percent: 100 Revision Name: deploy1-predictor-00001