kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.96k stars 3.94k forks source link

K8s Cluster Autoscaler on Self-Managed Kubernetes setup on AWS: no node group config and node not registered #4662

Closed RafaelMoreira1180778 closed 2 years ago

RafaelMoreira1180778 commented 2 years ago

Which component are you using?: Cluster-Autoscaler

What version of the component are you using?: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.1-2+b7120affd6631a", GitCommit:"b7120affd6631a91e76b96bbc38375a6681ef547", GitTreeState:"clean", BuildDate:"2022-01-11T15:52:38Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/arm64"}

What environment is this in?: AWS EC2 with MicroK8s

What did you expect to happen?:

The Cluster-Autoscaler fetch my current ASG status and adjust the desired state of the ASG accordingly to the needs of the cluster. I also expect for the CA to describe the tags and understand that the ASG have the correct tags.

What happened instead?:

Got the error Node should not be processed by cluster autoscaler (no node group config):

I0202 17:05:04.286617       1 static_autoscaler.go:230] Starting main loop
I0202 17:05:04.287206       1 filter_out_schedulable.go:65] Filtering out schedulables
I0202 17:05:04.287217       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0202 17:05:04.287223       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0202 17:05:04.287227       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0202 17:05:04.287234       1 filter_out_schedulable.go:82] No schedulable pods
I0202 17:05:04.287247       1 static_autoscaler.go:419] No unschedulable pods
I0202 17:05:04.287259       1 static_autoscaler.go:466] Calculating unneeded nodes
I0202 17:05:04.287278       1 pre_filtering_processor.go:57] Node ip-10-0-0-251 should not be processed by cluster autoscaler (no node group config)
I0202 17:05:04.287304       1 static_autoscaler.go:520] Scale down status: unneededOnly=false lastScaleUpTime=2022-02-02 16:03:43.860986168 +0000 UTC m=-3572.925062255 lastScaleDownDeleteTime=2022-02-02 16:03:43.860986168 +0000 UTC m=-3572.925062255 lastScaleDownFailTime=2022-02-02 16:03:43.860986168 +0000 UTC m=-3572.925062255 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false

Got the error:

1 unregistered nodes present
Removing unregistered node aws:///eu-west-1a/i-05173d9190cca5ebf

How to reproduce it (as minimally and precisely as possible):

Manifest Deployed:
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "namespaces"
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources:
      ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resourceNames: ["cluster-autoscaler"]
    resources: ["leases"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames:
      ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
    verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      priorityClassName: system-cluster-critical
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
      serviceAccountName: cluster-autoscaler
      tolerations:
        - effect: NoSchedule
          operator: "Equal"
          value: "true"
          key: node-role.kubernetes.io/master
      nodeSelector:
        kubernetes.io/role: master
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0
          name: cluster-autoscaler
          env:
            - name: AWS_DEFAULT_REGION
              value: eu-west-1
            - name: AWS_REGION
              value: eu-west-1
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/GP-ARM-ASG
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
AWS EC2 Instance IAM permissions:
{
        "Effect": "Allow",
        "Action": [
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:TerminateInstanceInAutoScalingGroup",
            "autoscaling:DescribeTags",
            "ec2:DescribeTags",
            "ec2:DescribeInstances",
            "ec2:DescribeLaunchTemplateVersions"
        ],
        "Resource": ["*"]
    }
CA Parameters reported by the pod logs:
flags.go:57] FLAG: --add-dir-header="false"
flags.go:57] FLAG: --address=":8085"
flags.go:57] FLAG: --alsologtostderr="false"
flags.go:57] FLAG: --aws-use-static-instance-list="false"
flags.go:57] FLAG: --balance-similar-node-groups="false"
flags.go:57] FLAG: --balancing-ignore-label="[]"
flags.go:57] FLAG: --cloud-config=""
flags.go:57] FLAG: --cloud-provider="aws"
flags.go:57] FLAG: --cloud-provider-gce-l7lb-src-cidrs="130.211.0.0/22,35.191.0.0/16"
flags.go:57] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
flags.go:57] FLAG: --cluster-name=""
flags.go:57] FLAG: --clusterapi-cloud-config-authoritative="false"
flags.go:57] FLAG: --cordon-node-before-terminating="false"
flags.go:57] FLAG: --cores-total="0:320000"
flags.go:57] FLAG: --daemonset-eviction-for-empty-nodes="false"
flags.go:57] FLAG: --daemonset-eviction-for-occupied-nodes="true"
flags.go:57] FLAG: --emit-per-nodegroup-metrics="false"
flags.go:57] FLAG: --estimator="binpacking"
flags.go:57] FLAG: --expander="least-waste"
flags.go:57] FLAG: --expendable-pods-priority-cutoff="-10"
flags.go:57] FLAG: --feature-gates=""
flags.go:57] FLAG: --gce-concurrent-refreshes="1"
flags.go:57] FLAG: --gpu-total="[]"
flags.go:57] FLAG: --ignore-daemonsets-utilization="false"
flags.go:57] FLAG: --ignore-mirror-pods-utilization="false"
flags.go:57] FLAG: --ignore-taint="[]"
flags.go:57] FLAG: --kubeconfig=""
flags.go:57] FLAG: --kubernetes=""
flags.go:57] FLAG: --leader-elect="true"
flags.go:57] FLAG: --leader-elect-lease-duration="15s"
flags.go:57] FLAG: --leader-elect-renew-deadline="10s"
flags.go:57] FLAG: --leader-elect-resource-lock="leases"
flags.go:57] FLAG: --leader-elect-resource-name="cluster-autoscaler"
flags.go:57] FLAG: --leader-elect-resource-namespace=""
flags.go:57] FLAG: --leader-elect-retry-period="2s"
flags.go:57] FLAG: --log-backtrace-at=":0"
flags.go:57] FLAG: --log-dir=""
flags.go:57] FLAG: --log-file=""
flags.go:57] FLAG: --log-file-max-size="1800"
flags.go:57] FLAG: --logtostderr="true"
flags.go:57] FLAG: --max-autoprovisioned-node-group-count="15"
flags.go:57] FLAG: --max-bulk-soft-taint-count="10"
flags.go:57] FLAG: --max-bulk-soft-taint-time="3s"
flags.go:57] FLAG: --max-empty-bulk-delete="10"
flags.go:57] FLAG: --max-failing-time="15m0s"
flags.go:57] FLAG: --max-graceful-termination-sec="600"
flags.go:57] FLAG: --max-inactivity="10m0s"
flags.go:57] FLAG: --max-node-provision-time="15m0s"
flags.go:57] FLAG: --max-nodes-total="0"
flags.go:57] FLAG: --max-total-unready-percentage="45"
flags.go:57] FLAG: --memory-total="0:6400000"
flags.go:57] FLAG: --min-replica-count="0"
flags.go:57] FLAG: --namespace="kube-system"
flags.go:57] FLAG: --new-pod-scale-up-delay="0s"
flags.go:57] FLAG: --node-autoprovisioning-enabled="false"
flags.go:57] FLAG: --node-deletion-delay-timeout="2m0s"
flags.go:57] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/GP-ARM-ASG]"
flags.go:57] FLAG: --nodes="[]"
flags.go:57] FLAG: --ok-total-unready-count="3"
flags.go:57] FLAG: --one-output="false"
flags.go:57] FLAG: --profiling="false"
flags.go:57] FLAG: --regional="false"
flags.go:57] FLAG: --scale-down-candidates-pool-min-count="50"
flags.go:57] FLAG: --scale-down-candidates-pool-ratio="0.1"
flags.go:57] FLAG: --scale-down-delay-after-add="10m0s"
flags.go:57] FLAG: --scale-down-delay-after-delete="0s"
flags.go:57] FLAG: --scale-down-delay-after-failure="3m0s"
flags.go:57] FLAG: --scale-down-enabled="true"
flags.go:57] FLAG: --scale-down-gpu-utilization-threshold="0.5"
flags.go:57] FLAG: --scale-down-non-empty-candidates-count="30"
flags.go:57] FLAG: --scale-down-unneeded-time="10m0s"
flags.go:57] FLAG: --scale-down-unready-time="20m0s"
flags.go:57] FLAG: --scale-down-utilization-threshold="0.5"
flags.go:57] FLAG: --scale-up-from-zero="true"
flags.go:57] FLAG: --scan-interval="10s"
flags.go:57] FLAG: --skip-headers="false"
flags.go:57] FLAG: --skip-log-headers="false"
flags.go:57] FLAG: --skip-nodes-with-local-storage="false"
flags.go:57] FLAG: --skip-nodes-with-system-pods="true"
flags.go:57] FLAG: --status-config-map-name="cluster-autoscaler-status"
flags.go:57] FLAG: --stderrthreshold="0"
flags.go:57] FLAG: --unremovable-node-recheck-timeout="5m0s"
flags.go:57] FLAG: --user-agent="cluster-autoscaler"
flags.go:57] FLAG: --v="4"
flags.go:57] FLAG: --vmodule=""
flags.go:57] FLAG: --write-status-configmap="true"

Anything else we need to know?:

1 Master Node not on any ASG. Cluster is formed by ASG EC2 instances.

The nodes inside the Kubernetes cluster have the correct providerID set.

The EC2 instance where the CA is runnning (master node) can perform aws sts get-caller-identity and perform all the necessary commands to retrieve the resource tags needed.

The ASG tags are the following (they are set to propagate on launch so these are also the tags of each EC2 inside the cluster, except for the master node):

k8s.io/cluster-autoscaler/enabled                       true
k8s.io/cluster-autoscaler/GP-ARM-ASG                    owned
k8s.io/cluster-autoscaler/node-template/label/type      gp
kubernetes.io/cluster/GP-ARM-ASG                        owned
k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

RafaelMoreira1180778 commented 2 years ago

/close

k8s-ci-robot commented 2 years ago

@RafaelMoreira1180778: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4662#issuecomment-1137133335): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rmoreira commented 1 year ago

@RafaelMoreira1180778 were you able to resolve this issue?

rubancar commented 1 year ago

At the end, how did you manage to solve this issue?