knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.5k stars 1.14k forks source link

Properly Handle Pods with schedulingGates #15308

Open tenzen-y opened 2 months ago

tenzen-y commented 2 months ago

/area autoscale

Describe the feature

I would like to request to appropriately handle the Pods with shcedulingGates (a.k.a gated pods). I'm wondering if the KPA should not scale deployment with gated pods into zero.

Now, as a workaround, we can avoid the accident automatically scale-in by setting the maximum value (2147483647) in the progressDeadlineSeconds.

background

Currently, the Knative Pod Autoscaler (KPA) automatically scales the not ready pods due to unschedulable into zero based on the progressDeadlineSeconds to improve better resource consumption, and the KPA doesn't consider the reason why the Pods are not schedulable. Additionally, we can not specify null in the progressDeadlineSeconds because the field is propagated to Deployment as well.

https://github.com/knative/serving/blob/3aa9210a0d1368c01ec6a7ccfa9e3db5a564a5a5/pkg/reconciler/autoscaling/kpa/scaler.go#L199-L204

https://github.com/knative/serving/blob/3aa9210a0d1368c01ec6a7ccfa9e3db5a564a5a5/pkg/deployment/config.go#L102

However, the unschedulable phase due to the scheduling gate is intentional behavior because the external controller, like the quota management/queuing system, stops scheduling the Pods into any Nodes.

Based on the current KPA implementation, once the gated pod is removed by scaled deployment into zero, the quota management/queueing system could never pop the Pods into the cluster.

skonto commented 2 months ago

Hi @tenzen-y thanks for the feature request. K8s behavior is to expire the deployment (progress deadline) and later if schedulinggates are removed it will progress the deployment successfully. If we want to support the feature natively I think that we could delay activation at the scaler side until we get active when schedulingGates are removed. We could check if a revision has schedulingGates in the pod spec and mark it (or ask the user to add an annotation) so we can later at the scaler side avoid the scaling down. @dprotaso wdyth?

tenzen-y commented 2 months ago

Hi @tenzen-y thanks for the feature request.

K8s behavior is to expire the deployment (progress deadline) and later if schedulinggates are removed it will progress the deployment successfully.

If we want to support the feature natively I think that we could delay activation at the scaler side until we get active when schedulingGates are removed. We could check if a revision has schedulingGates in the pod spec and mark it (or ask the user to add an annotation) so we can later at the scaler side avoid the scaling down. @dprotaso wdyth?

Hi @skonto, Thank you for responding to my requests. It sounds reasonable to me.

Additionally, I'm curious about whether we should calculate the not ready duration since ungated time instead of the creationTimeStamp. But, I'm not sure if my suggestion and understanding are correct since I'm not Knative experts based on implementation level.

skonto commented 2 months ago

Hi @tenzen-y, pods will be in the following status when gated.

        "status": {
            "conditions": [
                {
                    "lastProbeTime": null,
                    "lastTransitionTime": null,
                    "message": "Scheduling is blocked due to non-empty scheduling gates",
                    "reason": "SchedulingGated",
                    "status": "False",
                    "type": "PodScheduled"
                }
            ],
            "phase": "Pending",
            "qosClass": "BestEffort"
        }

In the logic where we handle scaling to zero we could do:

a) if we are in activating phase and we have scheduling gates for the revision then do:

a1) check via pod accessor (we initiliaze one in the kpa.go anyway) if all pods are not in pending state due to scheduling gates and have no gates set. If pods still have gates enqueue again after activationTimeout.

a2) otherwise, use the diff between now and minimum lastTransitionTime in condition status "type": "PodScheduled" for a pod to find if we should scale to zero as usual (don't use how long PA is in unknown status as we do now)

a3) if time has passed, scale down to zero as usual or enqueue again after activationTimeout

Now, as a workaround, we can avoid the accident automatically scale-in by setting the maximum value

Btw right now Serving does not support schedulingGates in its spec, and anything you specify there is ignored.

I guess some external controller gates the pods without Serving knowing about it, is that the use case we want to support?

If you are willing to contribute back, I would recommend to capture a potential solution in a feature doc, see https://github.com/knative/community/blob/17debd2a1727709a47e8763a9e68b56f1a2f5fb1/CONTRIBUTING.md#design-documents how to create one, so others can review the solution and do a PR as a poc to test it.

skonto commented 2 months ago

Another "gate" thing that we haven't considered is readinessGates:

    // If specified, all readiness gates will be evaluated for pod readiness.
    // A pod is ready when all its containers are ready AND
    // all conditions specified in the readiness gates have status equal to "True"
    // More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates
    // +optional
    ReadinessGates []PodReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,28,opt,name=readinessGates"`

See here for how this feature is used. cc @dprotaso @ReToCode

tenzen-y commented 2 months ago

Thank you for some suggestions and for sharing your investigations. I agree that we should prepare the enhancements proposal. But I don't have enough bandwidth to drive this enhancement. So, feel free to take this issue.

I guess some external controller gates the pods without Serving knowing about it, is that the use case we want to support?

I have a use case in the machinelearning cluster with the mixed lifecycle workloads. We deploy the training Jobs and Serving server into the single cluster, and then to improve GPU utilization efficiency, the quota management controller manages (in my cluster, that is Kueue) if the workload (typically Job and Knative Service Pod) should be deployed to the cluster or the workload should preempt other low priority workloads.

Although this is out of my use case, if Knative allows us to natively support the external quota management like Kueue, it will bring us better cluster autoscaling by ClusterAutoscaler ProvisioningRequest.

tenzen-y commented 2 months ago

Another "gate" thing that we haven't considered is readinessGates:

  // If specified, all readiness gates will be evaluated for pod readiness.
  // A pod is ready when all its containers are ready AND
  // all conditions specified in the readiness gates have status equal to "True"
  // More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates
  // +optional
  ReadinessGates []PodReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,28,opt,name=readinessGates"`

See here for how this feature is used. cc @dprotaso @ReToCode

Yeah, I know well the Pod ReadinessGate. But, the feature is slightly not to match the use case of the schedulingGate. I would propose taking the Pod ReadinessGate as another feature.

skonto commented 2 months ago

I would propose taking the Pod ReadinessGate as another feature.

Sure I was just adding it as a reference for other work, not to be handled here.