kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
539 stars 180 forks source link

kubernetes.io/hostname label not working in NodeAffinity for Aerospike Kubernetes Operator #1596

Open abhishekdwivedi3060 opened 3 weeks ago

abhishekdwivedi3060 commented 3 weeks ago

Similar issue in Karpenter: https://github.com/aws/karpenter-provider-aws/issues/4671 Duplicate of https://github.com/aws/karpenter-provider-aws/issues/6844 Related issue in Aerospike: https://github.com/aerospike/aerospike-kubernetes-operator/issues/305

Use-case: There is a feature in Aerospike Kubernetes Operator (AKO) called k8sNodeBlockList (list of K8s node names) where a user can define a list of K8s nodes that should be ignored from scheduling for Aerospike Cluster pods. This feature of AKO helps users in K8s cluster maintenance by migrating pods to other K8s nodes. It uses kubernetes.io/hostname label along with NotIn operator in the NodeAffinity to move pods away from those nodes.

        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: NotIn
                  values:
                  - gke-abhisek-test-default-pool-d04arw3-r5ts

Issue: Karpenter has a sweeping check where it blocks the kubernetes.io/hostname in NodeAffinity. Ref code: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1/labels.go#L91 As a result if there are pending pods with kubernetes.io/hostname NodeAffinity, they remain in pending state as Karpenter doesn't scale K8s node.

Questions:

  1. Is there a plan to remove that sweeping check for kubernetes.io/hostname label?
  2. Is it possible to only block In operator and allow NotIn operator for kubernetes.io/hostname label?
  3. Is there a work-around possible to bypass that check?
k8s-ci-robot commented 3 weeks ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
jwcesign commented 2 weeks ago

/assign

jonathan-innis commented 4 days ago

Can you use taints to achieve the same thing?

The problem with us allowing hostname affinities is that we can't guarantee that the node that we launch will actually be able to schedule the pods. I'll admit -- hostname affinity with NotIn is going to be much more likely to succeed. We definitely can't allow In affinities.

abhishekdwivedi3060 commented 2 days ago

Hi @jonathan-innis, thanks for responding.

I agree that something similar can be achieved by using Taints. However, the rationale behind doing it using hostname along with NotIn operator is:

  1. AKO doesn’t have permission for Node resource, so can’t add taint on its own. Many users don’t want to provide that level of permission to AKO.
  2. If we ask users to add Taint on the nodes on their own, it will we an extra step/effort for the user. Also, some of the users of AKO doesn’t have direct access to infra so can’t do that. They have different infra team.
  3. Adding Taints with NoExecute effect will result in pod eviction of all the running pods on that node. It may result in data loss (unless PodDisruptionBudget is used) if multiple pods of Aerospike DB are running on that node. AKO takes care of this scenario by moving 1 pod at a time.

High level use-case: Migrate Aerospike pods from a given list of K8s nodes without touching the infra by AKO (like tainting the node) and without asking the user to do it manually (user-friendly)