kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

feat: manage `PodDisruptionBudget` for `SparkApplication` driver and executor #2326

Open a7i opened 1 week ago

a7i commented 1 week ago

Purpose of this PR

Provide the ability to create PodDisruptionBudget per Spark Application

Proposed changes:

Change Category

Rationale

Our spark pipelines cannot be interrupted and during node drain, we want to prevent eviction of executor and driver pods. Once the pipeline is complete, then the node can be drained. This is natively supported via PodDisruptionBudget with maxUnavailable: 0

Checklist

Additional Notes

google-oss-prow[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign yuchaoran2011 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/spark-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
missedone commented 1 week ago

@a7i , I'm wondering if we can use the new feature of the pod template to specify the PDB per PR https://github.com/kubeflow/spark-operator/pull/2141

a7i commented 1 week ago

@a7i , I'm wondering if we can use the new feature of the pod template to specify the PDB per PR #2141

@missedone looks like a useful PR! How would pod template control PDB definition? Are you suggesting to implement a single PDB that prevents a common pod label from being evicted?

missedone commented 1 week ago

Ah right, it’s PDB which need a specific configuration item for it. My brain was stuck :(

jacobsalway commented 1 week ago

Happy to take a look at the PR as well because this may be a useful feature, but if this is specifically for node draining, you could add annotations to prevent eviction: karpenter.sh/do-not-disrupt: "true" for Karpenter and "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" for cluster-autoscaler.

a7i commented 1 week ago

"cluster-autoscaler.kubernetes.io/safe-to-evict"

Thanks! The context here is nodegroup upgrades, so we drain the nodes from the old nodegroup. so karpenter or cluster-autoscaler don't come into play.

Cian911 commented 1 week ago

This would be a great addition - thanks @a7i

jacobsalway commented 1 week ago

"cluster-autoscaler.kubernetes.io/safe-to-evict"

Thanks! The context here is nodegroup upgrades, so we drain the nodes from the old nodegroup. so karpenter or cluster-autoscaler don't come into play.

Ah, so this is a user initiated drain and not one done automatically by either node provisioner e.g. Karpenter drift detection or node consolidation. In that case definitely understand your issue with needing to provision PDBs.

Will review this tonight.