kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.89k stars 3.91k forks source link

Cluster autoscaler ignore node taints? #4231

Closed anemptyair closed 1 day ago

anemptyair commented 3 years ago

Which component are you using?: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.6

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Cluster auto scale ignore node taints

We have added some taints for cluster nodes. Auto scale will failed when node are enough but with taints that pods can not be schedule into them. Can cluster auto scaler ignore nodes taints ?

Describe the solution you'd like.:

Describe any alternative solutions you've considered.:

Additional context.:

bpineau commented 3 years ago

If those taints names are known in advance, there's a --ignore-taint flag to that effect.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 1 year ago

/remove-lifecycle stale

vadasambar commented 1 year ago

@pierluigilenoci would these be the right steps to reproduce this issue?

  1. Scale up a node
  2. Add a taint to the node
  3. Deploy a workload (Deployment/Pod) which wants to schedule on the new node (but can't because it doesn't have tolerations for the recently added taint)

cluster-autoscaler will not scale up thinking it can fit the workload on the tainted node.

pierluigilenoci commented 1 year ago

@vadasambar exactly.

vadasambar commented 1 year ago

cluster-autoscaler might not be considering the newly added taints like you say. Might need deeper investigation.

vadasambar commented 1 year ago

I can reproduce the problem.

Steps to reproduce

  1. Deploy your own CA (cluster-autoscaler) (how? check this)
  2. Induce scale up (use workload with anti-affinity to make CA bring up new nodes)
    
    kubectl scale deploy node-scale-up-with-pod-anti-affinity --replicas=4
    deployment.apps/node-scale-up-with-pod-anti-affinity scaled
![image](https://user-images.githubusercontent.com/34534103/226533719-38dec2da-a441-4288-b34a-42c4d143a3b7.png)

3. Taint the new node

kubectl taint no gke-cluster-1-default-pool-7c54e36a-h2gg test=true:NoSchedule node/gke-cluster-1-default-pool-7c54e36a-h2gg tainted

4. Create workload to schedule on the tainted node

kubectl create deploy nginx --image=nginx --replicas=1 deployment.apps/nginx created

Use `nodeSelector` to target the tainted node
![image](https://user-images.githubusercontent.com/34534103/226533880-bbe0e867-cb31-4cb6-bf62-c5db7e08bb75.png)

5. Pod gets stuck in `Pending` forever
![image](https://user-images.githubusercontent.com/34534103/226533079-96bcb5dd-1a21-4623-aec9-efe06b999931.png)

Example scale up workload
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: node-scale-up-with-pod-anti-affinity
  namespace: default
spec:
  selector:
    matchLabels:
      app: node-scale-up-with-pod-anti-affinity
  replicas: 1
  template:
    metadata:
      labels:
        app: node-scale-up-with-pod-anti-affinity
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - node-scale-up-with-pod-anti-affinity
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: node-scale-up-with-pod-anti-affinity
        image: registry.k8s.io/pause:2.0
pierluigilenoci commented 1 year ago

@vadasambar, your example is not quite 100% accurate. The new pod should be scheduled on the node with the taint because the others don't have enough resources. So you're forcing things too much. Ultimately you can cordon all nodes except the tainted one to see what happens. Or even better, put pods on the other nodes that consume enough resources so the new pod doesn't have enough room.

vadasambar commented 1 year ago

@pierluigilenoci maybe you can share the cluster-autoscaler log snippet around this problem? (anything that looks fishy)

pierluigilenoci commented 1 year ago

@vadasambar, I had this problem about two years ago, and in the meantime, I've found external workarounds to fix it. So I no longer have a log attached to these events.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 1 year ago

/remove-lifecycle stale

vadasambar commented 1 year ago

@pierluigilenoci if this important, please bring it up in the sig-autoscaling meeting. Sorry I don't have bandwidth to look deeper into this :pray:

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 7 months ago

It's still a problem, but I haven't had a chance to do anything about it.

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pierluigilenoci commented 6 months ago

😢

pierluigilenoci commented 6 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci commented 2 months ago

😭

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pierluigilenoci commented 1 month ago

😿

k8s-triage-robot commented 1 day ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 day ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/4231#issuecomment-2302076270): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
pierluigilenoci commented 14 hours ago

/remove-lifecycle rotten

pierluigilenoci commented 14 hours ago

/reopen

k8s-ci-robot commented 14 hours ago

@pierluigilenoci: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4231#issuecomment-2304055730): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.