Zonal Volume Requirements Break Topology Spread Constraints

jmdeal commented 5 months ago

Description

Background

Over the past few months we've received a couple of issues regarding TopologySpreadConstraints with StatefulSets in the aws-provider repository:

In both cases the observed behavior was that Karpenter provisioned nodes as expected initially, but after scaling down and back up Karpenter either provisions nodes that violate* the TopologySpreadConstraint or Karpenter is unable to provision new nodes (allegedly due to TSC violation).

* I use violate loosely here, the TSC may be respected from kube-scheduler's perspective but Karpenter does not spread pods across all domains it is aware of.

Reproduction and Validation

Reproducing this issue is relatively straightforward. The following NodePools and StatefulSet were used:

NodePools:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: on-demand-np
spec:
  disruption:
    consolidateAfter: 10m
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        # Virtual capacity-spread domain, 2/3 of nodes should launch on on-demand
        - key: capacity-spread
          operator: In
          values: ['1', '2']
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-np
spec:
  disruption:
    consolidateAfter: 10m
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ['c', 'm', 'r']
      - key: karpenter.sh/capacity-type
        operator: In
        values: ['spot']
      # Virtual capacity-spread domain, 1/3 of nodes should launch on spot
      - key: capacity-spread
        operator: In
        values: ['3']

StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: repro
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: repro
  template:
    metadata:
      labels:
        app.kubernetes.io/name: repro
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: kubernetes.io/hostname
              labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: repro
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 100m
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: capacity-spread
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: "repro"
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: "repro"
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      labels:
        app.kubernetes.io/name: repro
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      volumeMode: Filesystem

Each NodePool specifies a different capacity type, spot and on-demand. Additionally, each NodePool specifies a capacity-type label with three possible values (1, 2, or 3). The StatefulSet defines a TopologySpreadConstraint on this label. This is meant to ensure that 2/3 of the pods schedule to on-demand nodes and the remaining 1/3 schedule to spot nodes.

Follow these steps to reproduce:

Apply the NodePools and StatefulSet to your cluster
Wait for the StatefulSet to finish deploying (all volumes must exist)
Scale the StatefulSet to 0
Scale the StatefulSet back up to 3

On the initial scale-up, Karpenter correctly provisions nodes. Two on-demand nodes and one spot node, each in a different AZ (due to another TSC on the StatefulSet). When we scale-down and scale back up, Karpenter provisions three spot nodes, each in a different AZ. The pods are able to schedule, since kube-scheduler is only aware of the single domain (3), but this isn't what we expect Karpenter to do.

While in that previous scenario the pods were able to schedule, we can create a scenario where Karpenter gets stuck and pods are left pending. Consider the following node distribution:

Zone	Nodes (capacity-type value)
A	spot-3, od-2
B	od-1
C	spot-3

We now scale up our StatefulSet to 3 replicas, and pods schedule to the spot node in zone A and the od node in zone B. The remaining pod remains pending while Karpenter continues to nominate it for the spot node in zone C. This would violate the TopologySpreadConstraint with a skew of 2 between domains 2 and 3.

Diagnosis

During scheduling, Karpenter not only needs to consider requirements on the pod itself, but also requirements on volumes for the pod. This typically comes in the form of an additional zonal requirement. Internally, Karpenter models this by adding a zonal node affinity term to the pod (in-memory only). Unfortunately, this causes issues for TopologySpreadConstraints.

When computing skew across domains, only nodes that match a pod's NodeAffinity are counted by default. Since we inject a NodeAffinity term onto the pod after discovering it's volume requirements, this limits the nodes included when computing skew to the nodes in that volume's zone. This lines up with the behavior seen in reproduction. In the first case, where all nodes brought up in the second scale-up were spot, we only counted the single node in each zone for each pod so we believed there was a skew of 0 for each pods TSC. There's a similar story in the second example where we believed there was a skew of 1 for the pod in zone A, 0 in B, and 0 in C.

The issue laid out in https://github.com/aws/karpenter-provider-aws/issues/6141 is a little bit different in terms of symptoms (unable to provision), but the same zonal injection behavior is to blame. In that issue there were 5 pods in the StatefulSet, each with zonal and hostname anti-affinity. When the StatefulSet was scaled to 3 pods and back to 5, Karpenter failed to provision additional nodes believing TSC would be violated. Since Karpenter is only counting nodes in the same zone as the volume when computing skew, and it is aware of all zones, it believes it must create a node in a different zone to fulfill the TSC requirement. Since this is not possible, we fail to provision.

Versions:

Chart Version: 0.36.1
Kubernetes Version (kubectl version): 1.29
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jmdeal commented 5 months ago

/assign @jmdeal

jmdeal commented 5 months ago

/triage accepted

vadasambar commented 3 weeks ago

Any idea when this might be fixed? This prevents us from adopting Karpenter.

kubernetes-sigs / karpenter