kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.04k stars 3.96k forks source link

AKS CA doesn't work can't communicate with apiserver #1807

Closed mnothic closed 5 years ago

mnothic commented 5 years ago

Error:

F0319 21:03:36.597392       1 main.go:355] Failed to get nodes from apiserver: Get https://dev-pisclk8s-f30ccea9.hcp.westus2.azmk8s.io:443/api/v1/nodes: dial tcp: i/o timeout

Configuration:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["events","endpoints"]
  verbs: ["create", "patch"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["endpoints"]
  resourceNames: ["cluster-autoscaler"]
  verbs: ["get","update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["watch","list","get","update"]
- apiGroups: [""]
  resources: ["pods","services","replicationcontrollers","persistentvolumeclaims","persistentvolumes"]
  verbs: ["watch","list","get"]
- apiGroups: ["extensions"]
  resources: ["replicasets","daemonsets"]
  verbs: ["watch","list","get"]
- apiGroups: ["policy"]
  resources: ["poddisruptionbudgets"]
  verbs: ["watch","list"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["watch","list","get"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["cluster-autoscaler-status"]
  verbs: ["delete","get","update"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: v1
data:
  ClientID:<omited>
  ClientSecret: <omited>
  ResourceGroup: cGlzY2xkZXY=
  SubscriptionID: YTYwMThkZWYtNzcwZS00N2NjLThlODAtOTYyYzU4YWQ5ZjVj
  TenantID: M2JlYTQ3OGMtMTY4NC00YThjLThlODUtMDQ1ZWM1NGJhNDMw
  VMType: QUtT
  ClusterName: ZGV2LXBpc2Nsazhz
  NodeResourceGroup: TUNfcGlzY2xkZXZfZGV2LXBpc2NsazhzX3dlc3R1czI=
kind: Secret
metadata:
  name: cluster-autoscaler-azure
  namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      restartPolicy: Always
      serviceAccountName: cluster-autoscaler
      containers:
      - image: k8s.gcr.io/cluster-autoscaler:v1.13.2
        imagePullPolicy: Always
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 300Mi
        command:
        - ./cluster-autoscaler
        - --v=3
        - --logtostderr=true
        - --cloud-provider=azure
        - --skip-nodes-with-local-storage=false
        - --nodes=3:12:devminion
        env:
        - name: ARM_SUBSCRIPTION_ID
          valueFrom:
            secretKeyRef:
              key: SubscriptionID
              name: cluster-autoscaler-azure
        - name: ARM_RESOURCE_GROUP
          valueFrom:
            secretKeyRef:
              key: ResourceGroup
              name: cluster-autoscaler-azure
        - name: ARM_TENANT_ID
          valueFrom:
            secretKeyRef:
              key: TenantID
              name: cluster-autoscaler-azure
        - name: ARM_CLIENT_ID
          valueFrom:
            secretKeyRef:
              key: ClientID
              name: cluster-autoscaler-azure
        - name: ARM_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              key: ClientSecret
              name: cluster-autoscaler-azure
        - name: ARM_VM_TYPE
          valueFrom:
            secretKeyRef:
              key: VMType
              name: cluster-autoscaler-azure
        - name: AZURE_CLUSTER_NAME
          valueFrom:
            secretKeyRef:
              key: ClusterName
              name: cluster-autoscaler-azure
        - name: AZURE_NODE_RESOURCE_GROUP
          valueFrom:
            secretKeyRef:
              key: NodeResourceGroup
              name: cluster-autoscaler-azure

pod log:

kubectl -n kube-system logs -f deployments/cluster-autoscaler
I0319 21:03:06.595447       1 flags.go:52] FLAG: --address=":8085"
I0319 21:03:06.595465       1 flags.go:52] FLAG: --alsologtostderr="false"
I0319 21:03:06.595469       1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I0319 21:03:06.595473       1 flags.go:52] FLAG: --cloud-config=""
I0319 21:03:06.595476       1 flags.go:52] FLAG: --cloud-provider="azure"
I0319 21:03:06.595480       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0319 21:03:06.595486       1 flags.go:52] FLAG: --cluster-name=""
I0319 21:03:06.595490       1 flags.go:52] FLAG: --cores-total="0:320000"
I0319 21:03:06.595493       1 flags.go:52] FLAG: --estimator="binpacking"
I0319 21:03:06.595497       1 flags.go:52] FLAG: --expander="random"
I0319 21:03:06.595500       1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0319 21:03:06.595504       1 flags.go:52] FLAG: --gke-api-endpoint=""
I0319 21:03:06.595507       1 flags.go:52] FLAG: --gpu-total="[]"
I0319 21:03:06.595511       1 flags.go:52] FLAG: --httptest.serve=""
I0319 21:03:06.595514       1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0319 21:03:06.595518       1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0319 21:03:06.595521       1 flags.go:52] FLAG: --kubeconfig=""
I0319 21:03:06.595525       1 flags.go:52] FLAG: --kubernetes=""
I0319 21:03:06.595528       1 flags.go:52] FLAG: --leader-elect="true"
I0319 21:03:06.595534       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0319 21:03:06.595539       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0319 21:03:06.595542       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0319 21:03:06.595547       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0319 21:03:06.595551       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0319 21:03:06.595557       1 flags.go:52] FLAG: --log-dir=""
I0319 21:03:06.595560       1 flags.go:52] FLAG: --log-file=""
I0319 21:03:06.595564       1 flags.go:52] FLAG: --logtostderr="true"
I0319 21:03:06.595567       1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0319 21:03:06.595571       1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0319 21:03:06.595574       1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0319 21:03:06.595578       1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0319 21:03:06.595582       1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0319 21:03:06.595585       1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0319 21:03:06.595589       1 flags.go:52] FLAG: --max-nodes-total="0"
I0319 21:03:06.595592       1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0319 21:03:06.595597       1 flags.go:52] FLAG: --memory-total="0:6400000"
I0319 21:03:06.595600       1 flags.go:52] FLAG: --min-replica-count="0"
I0319 21:03:06.595604       1 flags.go:52] FLAG: --namespace="kube-system"
I0319 21:03:06.595607       1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0319 21:03:06.595611       1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0319 21:03:06.595614       1 flags.go:52] FLAG: --node-group-auto-discovery="[]"
I0319 21:03:06.595618       1 flags.go:52] FLAG: --nodes="[3:12:devminion]"
I0319 21:03:06.595622       1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0319 21:03:06.595625       1 flags.go:52] FLAG: --regional="false"
I0319 21:03:06.595629       1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0319 21:03:06.595632       1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0319 21:03:06.595636       1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0319 21:03:06.595639       1 flags.go:52] FLAG: --scale-down-delay-after-delete="10s"
I0319 21:03:06.595643       1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0319 21:03:06.595646       1 flags.go:52] FLAG: --scale-down-enabled="true"
I0319 21:03:06.595650       1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0319 21:03:06.595653       1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0319 21:03:06.595657       1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0319 21:03:06.595661       1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
I0319 21:03:06.595665       1 flags.go:52] FLAG: --scan-interval="10s"
I0319 21:03:06.595668       1 flags.go:52] FLAG: --skip-headers="false"
I0319 21:03:06.595672       1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I0319 21:03:06.595675       1 flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
I0319 21:03:06.595678       1 flags.go:52] FLAG: --stderrthreshold="2"
I0319 21:03:06.595682       1 flags.go:52] FLAG: --test.bench=""
I0319 21:03:06.595685       1 flags.go:52] FLAG: --test.benchmem="false"
I0319 21:03:06.595689       1 flags.go:52] FLAG: --test.benchtime="1s"
I0319 21:03:06.595692       1 flags.go:52] FLAG: --test.blockprofile=""
I0319 21:03:06.595695       1 flags.go:52] FLAG: --test.blockprofilerate="1"
I0319 21:03:06.595699       1 flags.go:52] FLAG: --test.count="1"
I0319 21:03:06.595702       1 flags.go:52] FLAG: --test.coverprofile=""
I0319 21:03:06.595705       1 flags.go:52] FLAG: --test.cpu=""
I0319 21:03:06.595709       1 flags.go:52] FLAG: --test.cpuprofile=""
I0319 21:03:06.595712       1 flags.go:52] FLAG: --test.failfast="false"
I0319 21:03:06.595715       1 flags.go:52] FLAG: --test.list=""
I0319 21:03:06.595719       1 flags.go:52] FLAG: --test.memprofile=""
I0319 21:03:06.595722       1 flags.go:52] FLAG: --test.memprofilerate="0"
I0319 21:03:06.595725       1 flags.go:52] FLAG: --test.mutexprofile=""
I0319 21:03:06.595728       1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I0319 21:03:06.595732       1 flags.go:52] FLAG: --test.outputdir=""
I0319 21:03:06.595735       1 flags.go:52] FLAG: --test.parallel="2"
I0319 21:03:06.595738       1 flags.go:52] FLAG: --test.run=""
I0319 21:03:06.595742       1 flags.go:52] FLAG: --test.short="false"
I0319 21:03:06.595783       1 flags.go:52] FLAG: --test.testlogfile=""
I0319 21:03:06.595799       1 flags.go:52] FLAG: --test.timeout="0s"
I0319 21:03:06.595803       1 flags.go:52] FLAG: --test.trace=""
I0319 21:03:06.595865       1 flags.go:52] FLAG: --test.v="false"
I0319 21:03:06.595872       1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0319 21:03:06.595876       1 flags.go:52] FLAG: --v="3"
I0319 21:03:06.595880       1 flags.go:52] FLAG: --vmodule=""
I0319 21:03:06.595896       1 flags.go:52] FLAG: --write-status-configmap="true"
I0319 21:03:06.595902       1 main.go:333] Cluster Autoscaler 1.13.2
F0319 21:03:36.597392       1 main.go:355] Failed to get nodes from apiserver: Get https://dev-pisclk8s-f30ccea9.hcp.westus2.azmk8s.io:443/api/v1/nodes: dial tcp: i/o timeout
paulgmiller commented 5 years ago

Are you sure the api server was up? We have actually seen some flakiness from aks that looks like that but usually it's back pretty quickly

mnothic commented 5 years ago

Are you sure the api server was up? We have actually seen some flakiness from aks that looks like that but usually it's back pretty quickly

Yes our AKS clusters are working with http routing plugins disable and custom ingress and externalDNS working perfect.

$ kubectl get pods --all-namespaces
NAME                                        READY   STATUS    RESTARTS   AGE
nginx-ingress-controller-56c5c48c4d-fjsps   1/1     Running   1          14d
nginx-ingress-controller-56c5c48c4d-q5qpg   1/1     Running   1          14d
coredns-754f947b4-pfvch                 1/1     Running            0          47h
coredns-754f947b4-tjtsp                 1/1     Running            0          47h
coredns-autoscaler-6fcdb7d64-ckp6r      1/1     Running            0          47h
external-dns-6c96464564-l5gr7           1/1     Running            0          14d
heapster-5fb7488d97-mpggq               2/2     Running            0          13d
kube-proxy-976np                        1/1     Running            1          46h
kube-proxy-ffd5c                        1/1     Running            0          13d
kube-proxy-k88zr                        1/1     Running            0          13d
kube-proxy-nhgmg                        1/1     Running            0          13d
kube-proxy-vff6f                        1/1     Running            0          46h
kube-svc-redirect-49bwd                 2/2     Running            0          14d
kube-svc-redirect-bpdrl                 2/2     Running            0          14d
kube-svc-redirect-qsxlx                 2/2     Running            0          46h
kube-svc-redirect-rpktl                 2/2     Running            0          14d
kube-svc-redirect-v29zt                 2/2     Running            2          46h
kubernetes-dashboard-847bb4ddc6-gnxbm   1/1     Running            0          47h
metrics-server-7b97f9cd9-cj9dt          1/1     Running            1          47h
omsagent-4sb86                          0/1     CrashLoopBackOff   3000       13d
omsagent-hwgsk                          0/1     CrashLoopBackOff   431        46h
omsagent-j27x9                          1/1     Running            471        46h
omsagent-jh7lx                          1/1     Running            2559       13d
omsagent-n65gs                          1/1     Running            2553       13d
omsagent-rs-6c9ffdd68-p5zj2             0/1     CrashLoopBackOff   3261       13d
tunnelfront-ffd8dc4f8-xgnmr             1/1     Running            0          47h

NOTE: I won't worry about omsagnet that shit it's always crashing :D

philipakash commented 5 years ago

having same problem with EKS

mnothic commented 5 years ago

having same problem with EKS

It was the CNI I had to change from kubelet to azure and works fine now.

feiskyer commented 5 years ago

/close

k8s-ci-robot commented 5 years ago

@feiskyer: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1807#issuecomment-497471115): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.