kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.74k stars 1.36k forks source link

Support for Pod Topology Spread Constraints #1848

Closed xuzeng012 closed 6 days ago

xuzeng012 commented 11 months ago

Pod Topology Spread Constraints are important for enhancing high availability and fault tolerance by ensuring that pods are distributed across different topologies, such as nodes in different availability zones or failure domains. This feature would be highly beneficial for users who want to deploy Spark workloads in multi-zone or multi-cluster environments to improve reliability and resilience. Users should be able to configure constraints such as maxSkew, minDomains, topologyKey, and whenUnsatisfiable for Spark pods to ensure they are spread across different nodes and failure domains.

ruhz3 commented 10 months ago

Spark Operator do support podAntiAffinity, so this is probably the best we can do for now.

spec:
  executor:
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: spark-app-name
                operator: In
                values:
                - {{ your service name }}
            topologyKey: kubernetes.io/hostname
korjek commented 8 months ago

@ruhz3 podAntiAffinity can't be used as a replacement for TSC. For example, if you want to run 10 executor pods on at least two nodes, this can't be achieved with affinity.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 6 days ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.