Force running all system pods on the control plane nodes

paalkr commented 7 years ago

Hi

I have a running Kubernetes cluster in AWS, with

3 etcd nodes
2 control plane nodes
3 worker nodes distributed in 3 worker pools (one pool in each AZ, with one node in each pool).

When I'm testing auto scaling of the worker node pools I see that some system critical pods are running on the worker nodes. Once in a while when a worker node is terminated by AWS, the critical pods are then terminated and redeployed to any running worked node. Not a big problem, but for example heapster statistics will not be available for the short period of time it takes Kubernetes to restart the pod on a running node.

Any reason in particular that these pods are not run on the control plane nodes? And can I force them to run on the control plane nodes by modifying the userdata-controller-file before running kube-aws up?

heapster
kube-dns
kube-dns-autoscaler
kube-rescheduler
kubernetes-dashboard

NAME READY STATUS RESTARTS AGE IP NODE heapster-v1.3.0-76786035-9qq4g 2/2 Running 0 14m 10.200.50.3 ip-10-1-44-196.eu-west-1.compute.internal kube-apiserver-ip-10-1-43-150.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.43.150 ip-10-1-43-150.eu-west-1.compute.internal kube-apiserver-ip-10-1-44-191.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.44.191 ip-10-1-44-191.eu-west-1.compute.internal kube-controller-manager-ip-10-1-43-150.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.43.150 ip-10-1-43-150.eu-west-1.compute.internal kube-controller-manager-ip-10-1-44-191.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.44.191 ip-10-1-44-191.eu-west-1.compute.internal kube-dns-3816048056-5tpmj 4/4 Running 0 1h 10.200.93.6 ip-10-1-43-42.eu-west-1.compute.internal kube-dns-3816048056-bw11s 4/4 Running 0 2h 10.200.12.3 ip-10-1-45-239.eu-west-1.compute.internal kube-dns-autoscaler-1464605019-k3s5k 1/1 Running 0 14m 10.200.50.4 ip-10-1-44-196.eu-west-1.compute.internal kube-proxy-ip-10-1-43-150.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.43.150 ip-10-1-43-150.eu-west-1.compute.internal kube-proxy-ip-10-1-43-42.eu-west-1.compute.internal 1/1 Running 0 1h 10.1.43.42 ip-10-1-43-42.eu-west-1.compute.internal kube-proxy-ip-10-1-44-191.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.44.191 ip-10-1-44-191.eu-west-1.compute.internal kube-proxy-ip-10-1-44-196.eu-west-1.compute.internal 1/1 Running 0 24m 10.1.44.196 ip-10-1-44-196.eu-west-1.compute.internal kube-proxy-ip-10-1-44-236.eu-west-1.compute.internal 1/1 Running 0 6m 10.1.44.236 ip-10-1-44-236.eu-west-1.compute.internal kube-proxy-ip-10-1-45-239.eu-west-1.compute.internal 1/1 Running 0 2h 10.1.45.239 ip-10-1-45-239.eu-west-1.compute.internal kube-rescheduler-3155147949-0rm2p 1/1 Running 0 1h 10.1.43.42 ip-10-1-43-42.eu-west-1.compute.internal kube-scheduler-ip-10-1-43-150.eu-west-1.compute.internal 1/1 Running 0 12h 10.1.43.150 ip-10-1-43-150.eu-west-1.compute.internal kube-scheduler-ip-10-1-44-191.eu-west-1.compute.internal 1/1 Running 1 12h 10.1.44.191 ip-10-1-44-191.eu-west-1.compute.internal kubernetes-dashboard-v1.5.1-lpnbb 1/1 Running 0 1h 10.200.93.4 ip-10-1-43-42.eu-west-1.compute.internal

mumoshu commented 7 years ago

@paalkr Hi, thanks for trying kube-aws!

AFAIK, these pods are scheduled to worker nodes by default in Kubernetes. I partly agree to you though - actually, I'm running tiller on controller nodes.

However, as of today, I personally believe that running pods like kube-dns on controller nodes wouldn't be a good idea. Controller nodes can't be auto-scaled easily due to the --apiserver-count param required by apiserver. If you've scaled out your worker nodes considerably, kube-dns auto-scaled by kube-dns-autoscaler would easily outgrow controller nodes. #499 is a related issue about the --apiserver-count param.

mumoshu commented 7 years ago

Also, you can definitely run any "system" pod on controller nodes by adding appropriate tolerations to tolerate taints associated only to controller nodes.

In k8s 1.6, the toleration look like: https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-controller#L713-L717

In k8s 1.5, it is in the annotations field instead.

paalkr commented 7 years ago

Hi

Thanks for the feedback. I see that scaling the dns service beyond the number of controller plane nodes can be necessary when scaling out a cluster to a certain level. My main concern is heapster, dashboard and reschduler that by default only runs in one pod and ar not scaled horizontally.

I will try to modify the controller user data adding the toleration for heapster and see how that work. I guess this will do the trick?

  - path: /srv/kubernetes/manifests/heapster-de.yaml
    content: |
        apiVersion: extensions/v1beta1
        kind: Deployment
        metadata:
          name: heapster-v1.3.0
          namespace: kube-system
          labels:
            k8s-app: heapster
            kubernetes.io/cluster-service: "true"
            version: v1.3.0
        spec:
          replicas: 1
          selector:
            matchLabels:
              k8s-app: heapster
              version: v1.3.0
          template:
            metadata:
              labels:
                k8s-app: heapster
                version: v1.3.0
              annotations:
                scheduler.alpha.kubernetes.io/critical-pod: ''
            spec:
              tolerations:
              - key: "CriticalAddonsOnly"
                operator: "Exists"
              - key: "node.alpha.kubernetes.io/role"
                operator: "Equal"
                value: "master"
                effect: "NoSchedule"
              containers:
                - image: gcr.io/google_containers/heapster:v1.3.0
                  name: heapster
                  livenessProbe:
                    httpGet:
                      path: /healthz
                      port: 8082
                      scheme: HTTP
                    initialDelaySeconds: 180
                    timeoutSeconds: 5
                  resources:
                    limits:
                      cpu: 80m
                      memory: 200Mi
                    requests:
                      cpu: 80m
                      memory: 200Mi
                  command:
                    - /heapster
                    - --source=kubernetes.summary_api:''
                - image: gcr.io/google_containers/addon-resizer:1.6
                  name: heapster-nanny
                  resources:
                    limits:
                      cpu: 50m
                      memory: 90Mi
                    requests:
                      cpu: 50m
                      memory: 90Mi
                  env:
                    - name: MY_POD_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.name
                    - name: MY_POD_NAMESPACE
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.namespace
                  command:
                    - /pod_nanny
                    - --cpu=80m
                    - --extra-cpu=4m
                    - --memory=200Mi
                    - --extra-memory=4Mi
                    - --threshold=5
                    - --deployment=heapster-v1.3.0
                    - --container=heapster
                    - --poll-period=300000
                    - --estimator=exponential

mumoshu commented 7 years ago

@paalkr LGTM 👍

Let me also add that, you should ensure system pods scheduled to conroller nodes to have:

Pod anti-affinities, not to schedule 2 equivalent pods to the same node
Have 2 or more replicas if it is designed to work like that

So that a rolling-update of controller nodes won't take down your system services.

paalkr commented 7 years ago

Thanks, good point. I'll add anti-afinity to the config as well.

Will heapster, dashboard and rescheduler run nicely when scaled to two pods each, or will they make trouble for each other?

cknowles commented 7 years ago

When I originally added the rescheduler it didn't play nice with more than one. That may be fixed now we unblocked port 443 between controllers. I can align it to the scheduler as part of the existing rescheduler issue.

paalkr commented 7 years ago

Thanks

@mumoshu , even without anti-affinity and two replicas a rolling update for the controller nodes wont be any worse to the system than the current situation? Like if the worker node that happens runs heapster, rescheduler or dashboard is terminated by AWS. Kubernates will just move the pods to the controller node not being updated, right?

Will this anit-affinity and scaling setting work for heapster?

EDIT: fixed typo, added missing spec:affinity: EDIT2: moved affinity into correct location, spec : template : spec : affinity

  - path: /srv/kubernetes/manifests/heapster-de.yaml
    content: |
        apiVersion: extensions/v1beta1
        kind: Deployment
        metadata:
          name: heapster-v1.3.0
          namespace: kube-system
          labels:
            k8s-app: heapster
            kubernetes.io/cluster-service: "true"
            version: v1.3.0
        spec:
          replicas: 2
          selector:
            matchLabels:
              k8s-app: heapster
              version: v1.3.0
          template:
            metadata:
              labels:
                k8s-app: heapster
                version: v1.3.0
              annotations:
                scheduler.alpha.kubernetes.io/critical-pod: ''
            spec:
              tolerations:
              - key: "CriticalAddonsOnly"
                operator: "Exists"
              - key: "node.alpha.kubernetes.io/role"
                operator: "Equal"
                value: "master"
                effect: "NoSchedule"
              affinity:
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: k8s-app
                        operator: In
                        values:
                        - heapster
                    topologyKey: kubernetes.io/hostname                 
              containers:
                - image: gcr.io/google_containers/heapster:v1.3.0
                  name: heapster
                  livenessProbe:
                    httpGet:
                      path: /healthz
                      port: 8082
                      scheme: HTTP
                    initialDelaySeconds: 180
                    timeoutSeconds: 5
                  resources:
                    limits:
                      cpu: 80m
                      memory: 200Mi
                    requests:
                      cpu: 80m
                      memory: 200Mi
                  command:
                    - /heapster
                    - --source=kubernetes.summary_api:''
                - image: gcr.io/google_containers/addon-resizer:1.6
                  name: heapster-nanny
                  resources:
                    limits:
                      cpu: 50m
                      memory: 90Mi
                    requests:
                      cpu: 50m
                      memory: 90Mi
                  env:
                    - name: MY_POD_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.name
                    - name: MY_POD_NAMESPACE
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.namespace
                  command:
                    - /pod_nanny
                    - --cpu=80m
                    - --extra-cpu=4m
                    - --memory=200Mi
                    - --extra-memory=4Mi
                    - --threshold=5
                    - --deployment=heapster-v1.3.0
                    - --container=heapster
                    - --poll-period=300000
                    - --estimator=exponential

mumoshu commented 7 years ago

@paalkr Sorry my explanation was'nt complete.

If you'd like to endure just rolling updates of nodes, setting maxUnavailable to zero plus pod anti affinity would be enough
If you'd like to endure both rolling updates and sudden death of nodes, at least having 2 or more replicas plus pod anti affinity, and optionally setting maxUnavailable to zero would be enough

paalkr commented 7 years ago

I see. I tried to just do an in place update of the heapster deployment by

kubectl replace -n kube-systen -f heapster-deployment.yaml

But the heapster pods (now two of them) does still run on the worker nodes

content of heapster-deployment.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: heapster-v1.3.0
  namespace: kube-system
  labels:
    k8s-app: heapster
    kubernetes.io/cluster-service: "true"
    version: v1.3.0
spec:
  replicas: 2
  selector:
    matchLabels:
      k8s-app: heapster
      version: v1.3.0
  template:
    metadata:
      labels:
        k8s-app: heapster
        version: v1.3.0
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - key: "node.alpha.kubernetes.io/role"
        operator: "Equal"
        value: "master"
        effect: "NoSchedule"
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - heapster
            topologyKey: kubernetes.io/hostname            
      containers:
        - image: gcr.io/google_containers/heapster:v1.3.0
          name: heapster
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8082
              scheme: HTTP
            initialDelaySeconds: 180
            timeoutSeconds: 5
          resources:
            limits:
              cpu: 80m
              memory: 200Mi
            requests:
              cpu: 80m
              memory: 200Mi
          command:
            - /heapster
            - --source=kubernetes.summary_api:''
        - image: gcr.io/google_containers/addon-resizer:1.6
          name: heapster-nanny
          resources:
            limits:
              cpu: 50m
              memory: 90Mi
            requests:
              cpu: 50m
              memory: 90Mi
          env:
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: MY_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          command:
            - /pod_nanny
            - --cpu=80m
            - --extra-cpu=4m
            - --memory=200Mi
            - --extra-memory=4Mi
            - --threshold=5
            - --deployment=heapster-v1.3.0
            - --container=heapster
            - --poll-period=300000
            - --estimator=exponential

paalkr commented 7 years ago

Sorry, didn't mean to close the issue ;) I just hit the wrong comment button.

@mumoshu , adding the proper toleration to the heapster pod spec will only allow the pod to run on the controller, but not force the pod to run only on the controller. For this I have to add a nodeAffinity to attract the pod to any of the controller nodes.

paalkr commented 7 years ago

I really can't get this right...

I thought I've had created the correct affinity and tolerations now, but the heapster pods always launches on the worker nodes. How can I debug this further?

My heapster depolyment definition

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: heapster-v1.3.0
  namespace: kube-system
  labels:
    k8s-app: heapster
    kubernetes.io/cluster-service: "true"
    version: v1.3.0
spec:
  replicas: 2
  # selector:
    # matchLabels:
      # k8s-app: heapster
      # version: v1.3.0
  template:
    metadata:
      labels:
        k8s-app: heapster
        version: v1.3.0
      # annotations:
        # scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - key: "node.alpha.kubernetes.io/role"
        operator: "Equal"
        value: "master"
        effect: "NoSchedule"
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - heapster
            topologyKey: kubernetes.io/hostname 
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kube-aws.coreos.com/role
                operator: NotIn
                values:
                - worker
      containers:
        - image: gcr.io/google_containers/heapster:v1.3.0
          name: heapster
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8082
              scheme: HTTP
            initialDelaySeconds: 180
            timeoutSeconds: 5
          resources:
            limits:
              cpu: 80m
              memory: 200Mi
            requests:
              cpu: 80m
              memory: 200Mi
          command:
            - /heapster
            - --source=kubernetes.summary_api:''
        - image: gcr.io/google_containers/addon-resizer:1.6
          name: heapster-nanny
          resources:
            limits:
              cpu: 50m
              memory: 90Mi
            requests:
              cpu: 50m
              memory: 90Mi
          env:
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: MY_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          command:
            - /pod_nanny
            - --cpu=80m
            - --extra-cpu=4m
            - --memory=200Mi
            - --extra-memory=4Mi
            - --threshold=5
            - --deployment=heapster-v1.3.0
            - --container=heapster
            - --poll-period=300000
            - --estimator=exponential

A controller node spec

- apiVersion: v1
  kind: Node
  metadata:
    annotations:
      node.alpha.kubernetes.io/ttl: "0"
      volumes.kubernetes.io/controller-managed-attach-detach: "true"
    creationTimestamp: 2017-04-19T22:36:35Z
    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/instance-type: t2.medium
      beta.kubernetes.io/os: linux
      failure-domain.beta.kubernetes.io/region: eu-west-1
      failure-domain.beta.kubernetes.io/zone: eu-west-1b
      kubernetes.io/hostname: ip-10-1-44-191.eu-west-1.compute.internal
    name: ip-10-1-44-191.eu-west-1.compute.internal
    namespace: ""
    resourceVersion: "126535"
    selfLink: /api/v1/nodesip-10-1-44-191.eu-west-1.compute.internal
    uid: a4cdc10a-2550-11e7-a0cd-02d5584ffffb
  spec:
    externalID: i-02756bb5346d3299d
    providerID: aws:///eu-west-1b/i-02756bb5346d3299d
    taints:
    - effect: NoSchedule
      key: node.alpha.kubernetes.io/role
      timeAdded: null
      value: master
  status:
    addresses:
    - address: 10.1.44.191
      type: InternalIP
    - address: 10.1.44.191
      type: LegacyHostIP
    - address: ip-10-1-44-191.eu-west-1.compute.internal
      type: InternalDNS
    - address: ip-10-1-44-191.eu-west-1.compute.internal
      type: Hostname
    allocatable:
      cpu: "2"
      memory: 3947136Ki
      pods: "110"
    capacity:
      cpu: "2"
      memory: 4049536Ki
      pods: "110"
    conditions:
    - lastHeartbeatTime: 2017-04-20T18:10:11Z
      lastTransitionTime: 2017-04-19T22:36:35Z
      message: kubelet has sufficient disk space available
      reason: KubeletHasSufficientDisk
      status: "False"
      type: OutOfDisk
    - lastHeartbeatTime: 2017-04-20T18:10:11Z
      lastTransitionTime: 2017-04-19T22:36:35Z
      message: kubelet has sufficient memory available
      reason: KubeletHasSufficientMemory
      status: "False"
      type: MemoryPressure
    - lastHeartbeatTime: 2017-04-20T18:10:11Z
      lastTransitionTime: 2017-04-19T22:36:35Z
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure
    - lastHeartbeatTime: 2017-04-20T18:10:11Z
      lastTransitionTime: 2017-04-19T22:36:35Z
      message: kubelet is posting ready status
      reason: KubeletReady
      status: "True"
      type: Ready
    daemonEndpoints:
      kubeletEndpoint:
        Port: 10250
    images:
    - names:
      - quay.io/coreos/hyperkube@sha256:1c8b4487be52a6df7668135d88b4c375aeeda4d934e34dbf5a8191c96161a8f5
      - quay.io/coreos/hyperkube:v1.6.1_coreos.0
      sizeBytes: 664861472
    - names:
      - gcr.io/google_containers/heapster@sha256:3dff9b2425a196aa51df0cebde0f8b427388425ba84568721acf416fa003cd5c
      - gcr.io/google_containers/heapster:v1.3.0
      sizeBytes: 68105973
    - names:
      - gcr.io/google_containers/addon-resizer@sha256:ba506f5f21356331d92141ee48fc4945fd467ec6010364ae970342de5477272c
      - gcr.io/google_containers/addon-resizer:1.6
      sizeBytes: 48784610
    - names:
      - gcr.io/google_containers/pause-amd64@sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516
      - gcr.io/google_containers/pause-amd64:3.0
      sizeBytes: 746888
    nodeInfo:
      architecture: amd64
      bootID: e66ecfc6-6231-41c0-9f5a-320feda7f400
      containerRuntimeVersion: docker://1.12.6
      kernelVersion: 4.9.16-coreos-r1
      kubeProxyVersion: v1.6.1+coreos.0
      kubeletVersion: v1.6.1+coreos.0
      machineID: 8e025a21a4254e11b028584d9d8b12c4
      operatingSystem: linux
      osImage: Container Linux by CoreOS 1298.7.0 (Ladybug)
      systemUUID: EC238E36-080F-BAFB-608E-8C11B6B2F37E
- apiVersion: v1
  kind: Node
  metadata:
    annotations:
      kube-aws.coreos.com/securitygroups: k8sprod-prerequisites-WorkerSecurityGroup-6O3GXX71Z193,k8sprod-Controlplane-1RJHQ7DSHBPTR-SecurityGroupWorker-FAESBHBU1F3T
      node.alpha.kubernetes.io/ttl: "0"
      volumes.kubernetes.io/controller-managed-attach-detach: "true"
    creationTimestamp: 2017-04-20T10:32:34Z
    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/instance-type: t2.large
      beta.kubernetes.io/os: linux
      failure-domain.beta.kubernetes.io/region: eu-west-1
      failure-domain.beta.kubernetes.io/zone: eu-west-1b
      kube-aws.coreos.com/autoscalinggroup: k8sprod-T2largeB-1ANR4S4CPBPZ7-Workers-11MVW8HICN4JB
      kube-aws.coreos.com/launchconfiguration: k8sprod-T2largeB-1ANR4S4CPBPZ7-WorkersLC-169AP7153B6DN
      kube-aws.coreos.com/role: worker
      kubernetes.io/hostname: ip-10-1-44-236.eu-west-1.compute.internal
    name: ip-10-1-44-236.eu-west-1.compute.internal
    namespace: ""
    resourceVersion: "126539"
    selfLink: /api/v1/nodesip-10-1-44-236.eu-west-1.compute.internal
    uid: aaa61ee8-25b4-11e7-9b85-06433a3e9fe9
  spec:
    externalID: i-0c166cdaa4e7002e7
    providerID: aws:///eu-west-1b/i-0c166cdaa4e7002e7
  status:
    addresses:
    - address: 10.1.44.236
      type: InternalIP
    - address: 10.1.44.236
      type: LegacyHostIP
    - address: ip-10-1-44-236.eu-west-1.compute.internal
      type: InternalDNS
    - address: ip-10-1-44-236.eu-west-1.compute.internal
      type: Hostname
    allocatable:
      cpu: "2"
      memory: 8075900Ki
      pods: "110"
    capacity:
      cpu: "2"
      memory: 8178300Ki
      pods: "110"
    conditions:
    - lastHeartbeatTime: 2017-04-20T18:10:13Z
      lastTransitionTime: 2017-04-20T10:32:34Z
      message: kubelet has sufficient disk space available
      reason: KubeletHasSufficientDisk
      status: "False"
      type: OutOfDisk
    - lastHeartbeatTime: 2017-04-20T18:10:13Z
      lastTransitionTime: 2017-04-20T10:32:34Z
      message: kubelet has sufficient memory available
      reason: KubeletHasSufficientMemory
      status: "False"
      type: MemoryPressure
    - lastHeartbeatTime: 2017-04-20T18:10:13Z
      lastTransitionTime: 2017-04-20T10:32:34Z
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure
    - lastHeartbeatTime: 2017-04-20T18:10:13Z
      lastTransitionTime: 2017-04-20T10:32:44Z
      message: kubelet is posting ready status
      reason: KubeletReady
      status: "True"
      type: Ready
    daemonEndpoints:
      kubeletEndpoint:
        Port: 10250
    images:
    - names:
      - xxx.dkr.ecr.eu-west-1.amazonaws.com/ags/105@sha256:20f391bc99458c7bf926f2ecee5bda3db34f0781e45f2969730bd3bacf74cad2
      - 893008332793.dkr.ecr.eu-west-1.amazonaws.com/ags/105:GeomapAdmin_1.0.1
      sizeBytes: 7265879019
    - names:
      - quay.io/coreos/hyperkube@sha256:1c8b4487be52a6df7668135d88b4c375aeeda4d934e34dbf5a8191c96161a8f5
      - quay.io/coreos/hyperkube:v1.6.1_coreos.0
      sizeBytes: 664861472
    - names:
      - gcr.io/google_containers/echoserver@sha256:5d99aa1120524c801bc8c1a7077e8f5ec122ba16b6dda1a5d3826057f67b9bcb
      - gcr.io/google_containers/echoserver:1.4
      sizeBytes: 140366210
    - names:
      - quay.io/coreos/awscli@sha256:712772e2329b24c203462a72f967a330621d2024b5a5a3545b0bb46dc12efd16
      - quay.io/coreos/awscli:master
      sizeBytes: 97498295
    - names:
      - gcr.io/google_containers/heapster@sha256:3dff9b2425a196aa51df0cebde0f8b427388425ba84568721acf416fa003cd5c
      - gcr.io/google_containers/heapster:v1.3.0
      sizeBytes: 68105973
    - names:
      - gcr.io/google_containers/addon-resizer@sha256:ba506f5f21356331d92141ee48fc4945fd467ec6010364ae970342de5477272c
      - gcr.io/google_containers/addon-resizer:1.6
      sizeBytes: 48784610
    - names:
      - gcr.io/google_containers/cluster-proportional-autoscaler-amd64@sha256:5a3bdd25a5b0f7f8f285e8ff8f4402cf86ddfdfa537e9f053c77c5f043821f70
      - gcr.io/google_containers/cluster-proportional-autoscaler-amd64:1.0.0
      sizeBytes: 48155586
    - names:
      - gcr.io/google_containers/defaultbackend@sha256:ee3aa1187023d0197e3277833f19d9ef7df26cee805fef32663e06c7412239f9
      - gcr.io/google_containers/defaultbackend:1.0
      sizeBytes: 7510068
    - names:
      - gcr.io/google_containers/pause-amd64@sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516
      - gcr.io/google_containers/pause-amd64:3.0
      sizeBytes: 746888
    nodeInfo:
      architecture: amd64
      bootID: ab00ea02-217d-47b4-8f78-96f98a937717
      containerRuntimeVersion: docker://1.12.6
      kernelVersion: 4.9.16-coreos-r1
      kubeProxyVersion: v1.6.1+coreos.0
      kubeletVersion: v1.6.1+coreos.0
      machineID: 8e025a21a4254e11b028584d9d8b12c4
      operatingSystem: linux
      osImage: Container Linux by CoreOS 1298.7.0 (Ladybug)
      systemUUID: EC21F8F9-0AE4-8B0C-425B-5C910B0A5CBB

and a worker node spec

- apiVersion: v1
  kind: Node
  metadata:
    annotations:
      kube-aws.coreos.com/securitygroups: k8sprod-prerequisites-WorkerSecurityGroup-6O3GXX71Z193,k8sprod-Controlplane-1RJHQ7DSHBPTR-SecurityGroupWorker-FAESBHBU1F3T
      node.alpha.kubernetes.io/ttl: "0"
      volumes.kubernetes.io/controller-managed-attach-detach: "true"
    creationTimestamp: 2017-04-20T08:21:16Z
    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/instance-type: t2.large
      beta.kubernetes.io/os: linux
      failure-domain.beta.kubernetes.io/region: eu-west-1
      failure-domain.beta.kubernetes.io/zone: eu-west-1c
      kube-aws.coreos.com/autoscalinggroup: k8sprod-T2largeC-17Z2CTVC1TJGE-Workers-4H8RK5JLPXND
      kube-aws.coreos.com/launchconfiguration: k8sprod-T2largeC-17Z2CTVC1TJGE-WorkersLC-1F7XCJ5EQBU7X
      kube-aws.coreos.com/role: worker
      kubernetes.io/hostname: ip-10-1-45-239.eu-west-1.compute.internal
    name: ip-10-1-45-239.eu-west-1.compute.internal
    namespace: ""
    resourceVersion: "126528"
    selfLink: /api/v1/nodesip-10-1-45-239.eu-west-1.compute.internal
    uid: 531103c6-25a2-11e7-a0cd-02d5584ffffb
  spec:
    externalID: i-0aa0e5fad1ec73293
    providerID: aws:///eu-west-1c/i-0aa0e5fad1ec73293
  status:
    addresses:
    - address: 10.1.45.239
      type: InternalIP
    - address: 10.1.45.239
      type: LegacyHostIP
    - address: ip-10-1-45-239.eu-west-1.compute.internal
      type: InternalDNS
    - address: ip-10-1-45-239.eu-west-1.compute.internal
      type: Hostname
    allocatable:
      cpu: "2"
      memory: 8075900Ki
      pods: "110"
    capacity:
      cpu: "2"
      memory: 8178300Ki
      pods: "110"
    conditions:
    - lastHeartbeatTime: 2017-04-20T18:10:06Z
      lastTransitionTime: 2017-04-20T08:21:16Z
      message: kubelet has sufficient disk space available
      reason: KubeletHasSufficientDisk
      status: "False"
      type: OutOfDisk
    - lastHeartbeatTime: 2017-04-20T18:10:06Z
      lastTransitionTime: 2017-04-20T08:21:16Z
      message: kubelet has sufficient memory available
      reason: KubeletHasSufficientMemory
      status: "False"
      type: MemoryPressure
    - lastHeartbeatTime: 2017-04-20T18:10:06Z
      lastTransitionTime: 2017-04-20T08:21:16Z
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure
    - lastHeartbeatTime: 2017-04-20T18:10:06Z
      lastTransitionTime: 2017-04-20T08:21:26Z
      message: kubelet is posting ready status
      reason: KubeletReady
      status: "True"
      type: Ready
    daemonEndpoints:
      kubeletEndpoint:
        Port: 10250
    images:
    - names:
      - xxx.dkr.ecr.eu-west-1.amazonaws.com/ags/105@sha256:20f391bc99458c7bf926f2ecee5bda3db34f0781e45f2969730bd3bacf74cad2
      - 893008332793.dkr.ecr.eu-west-1.amazonaws.com/ags/105:GeomapAdmin_1.0.1
      sizeBytes: 7265879019
    - names:
      - quay.io/coreos/hyperkube@sha256:1c8b4487be52a6df7668135d88b4c375aeeda4d934e34dbf5a8191c96161a8f5
      - quay.io/coreos/hyperkube:v1.6.1_coreos.0
      sizeBytes: 664861472
    - names:
      - gcr.io/google_containers/nginx-ingress-controller@sha256:995427304f514ac1b70b2c74ee3c6d4d4ea687fb2dc63a1816be15e41cf0e063
      - gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.3
      sizeBytes: 121204435
    - names:
      - quay.io/coreos/awscli@sha256:712772e2329b24c203462a72f967a330621d2024b5a5a3545b0bb46dc12efd16
      - quay.io/coreos/awscli:master
      sizeBytes: 97498295
    - names:
      - gcr.io/google_containers/heapster@sha256:3dff9b2425a196aa51df0cebde0f8b427388425ba84568721acf416fa003cd5c
      - gcr.io/google_containers/heapster:v1.3.0
      sizeBytes: 68105973
    - names:
      - gcr.io/google_containers/addon-resizer@sha256:ba506f5f21356331d92141ee48fc4945fd467ec6010364ae970342de5477272c
      - gcr.io/google_containers/addon-resizer:1.6
      sizeBytes: 48784610
    - names:
      - gcr.io/google_containers/kubedns-amd64@sha256:3d3d67f519300af646e00adcf860b2f380d35ed4364e550d74002dadace20ead
      - gcr.io/google_containers/kubedns-amd64:1.9
      sizeBytes: 46998769
    - names:
      - gcr.io/google_containers/dnsmasq-metrics-amd64@sha256:4063e37fd9b2fd91b7cc5392ed32b30b9c8162c4c7ad2787624306fc133e80a9
      - gcr.io/google_containers/dnsmasq-metrics-amd64:1.0
      sizeBytes: 13998769
    - names:
      - gcr.io/google_containers/exechealthz-amd64@sha256:503e158c3f65ed7399f54010571c7c977ade7fe59010695f48d9650d83488c0a
      - gcr.io/google_containers/exechealthz-amd64:1.2
      sizeBytes: 8374840
    - names:
      - gcr.io/google_containers/defaultbackend@sha256:ee3aa1187023d0197e3277833f19d9ef7df26cee805fef32663e06c7412239f9
      - gcr.io/google_containers/defaultbackend:1.0
      sizeBytes: 7510068
    - names:
      - gcr.io/google_containers/kube-dnsmasq-amd64@sha256:a722df15c0cf87779aad8ba2468cf072dd208cb5d7cfcaedd90e66b3da9ea9d2
      - gcr.io/google_containers/kube-dnsmasq-amd64:1.4
      sizeBytes: 5126001
    - names:
      - gcr.io/google_containers/pause-amd64@sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516
      - gcr.io/google_containers/pause-amd64:3.0
      sizeBytes: 746888
    nodeInfo:
      architecture: amd64
      bootID: f816b350-c0f5-40b1-a836-e011c02b2f78
      containerRuntimeVersion: docker://1.12.6
      kernelVersion: 4.9.16-coreos-r1
      kubeProxyVersion: v1.6.1+coreos.0
      kubeletVersion: v1.6.1+coreos.0
      machineID: 8e025a21a4254e11b028584d9d8b12c4
      operatingSystem: linux
      osImage: Container Linux by CoreOS 1298.7.0 (Ladybug)
      systemUUID: EC2905EE-B013-15FA-984A-C828014882A1

paalkr commented 7 years ago

Hmm, sorry for spamming with questions and comments

To me it seems like my deployment is ignoring all my tolerations and affinity configuration. I tried to use nodeAffinity to force heapster to run on a specific worker, but still a random worker is picket by the scheduler

        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # - key: kube-aws.coreos.com/role
                # operator: NotIn
                # values:
                # - worker
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-1-45-239.eu-west-1.compute.internal

Using the good old nodeSelector does work though, and heapster ends up on the specified node.

      nodeSelector:
        kubernetes.io/hostname: ip-10-1-45-239.eu-west-1.compute.internalal

But if I try to use nodeSelector to force the heapster pod to a controller node the scheduler complains about the taint not being accepted. Even though I have added this tolerations to my deployment file

      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: 'node.alpha.kubernetes.io/role'
        operator: Equal
        value: master
        effect: NoSchedule

mumoshu commented 7 years ago

Quick comment after looking at your above example yaml - shouldn't nodeAffinity in pod spec rather than deployment spec?(does it even pass the validation when in deployment spec? 2017年4月21日(金) 7:08 paalkr notifications@github.com:

Hmm, sorry for spamming with questions and comments

To me it seems like my deployment is ignoring all my tolerations and affinity configuration. I tried to use nodeAffinity to force heapster to run on a specific worker, but still a random worker is picket by the scheduler
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          # - key: kube-aws.coreos.com/role
            # operator: NotIn
            # values:
            # - worker
          - key: kubernetes.io/hostname
            operator: In
            values:
            - ip-10-1-45-239.eu-west-1.compute.internal
Using the good old nodeSelector does work though, and heapster ends up on the specified node.
  nodeSelector:
    kubernetes.io/hostname: ip-10-1-45-239.eu-west-1.compute.internalal
But if I try to use nodeSelector to force the heapster pod to a controller node the scheduler complains about the taint not being accepted. Even though I have added this tolerations to my deployment file
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: 'node.alpha.kubernetes.io/role'
    operator: Equal
    value: master
    effect: NoSchedule
— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/kube-aws/issues/566#issuecomment-295937194, or mute the thread https://github.com/notifications/unsubscribe-auth/AABV-RyB5nKxY6Chb3Ao5mGCoFDi9xmzks5rx9dMgaJpZM4NC3c4 .

paalkr commented 7 years ago

@mumoshu , my first example had this wrong and it was correctly not passing validation. But I thought I got this right in the latest configuration i posted. I put it in here as well just for reference.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: heapster-v1.3.0
  namespace: kube-system
  labels:
    k8s-app: heapster
    kubernetes.io/cluster-service: "true"
    version: v1.3.0
spec:
  replicas: 2
  # selector:
    # matchLabels:
      # k8s-app: heapster
      # version: v1.3.0
  template:
    metadata:
      labels:
        k8s-app: heapster
        version: v1.3.0
      # annotations:
        # scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - key: "node.alpha.kubernetes.io/role"
        operator: "Equal"
        value: "master"
        effect: "NoSchedule"
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - heapster
            topologyKey: kubernetes.io/hostname 
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kube-aws.coreos.com/role
                operator: NotIn
                values:
                - worker
......

paalkr commented 7 years ago

@mumoshu , Any suggestions why the tolerations and affinity settings are not respected?

mumoshu commented 7 years ago

@paalkr AFAICS, kube-aws worker nodes aren't labeled with kube-aws.coreos.com/role by default. Could you ensure that they're explicitly labeled with appropriate config in cluster.yaml like:

worker:
  nodePools:
  - name: pool1
     nodeLabels:
       kube-aws.coreos.com/role: worker

?

paalkr commented 7 years ago

Hi

My cluster.yaml file is modified to properly tag workers, according to your suggestion. So the tag are in place. But still the nodeAffinity does not function as expected. The pods are attracted to the workers, not the controllers.

mumoshu commented 7 years ago

@paalkr Hi! Could you share me the result of kubectl describe node <one of your worker node> and kubectl desribe node <one of your controller node>?

paalkr commented 7 years ago

@mumoshu , thanks for taking time to help out. Highly appreciated!

Please find the output below

Controller:

Name:                   ip-10-1-43-150.eu-west-1.compute.internal
Role:
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=t2.medium
                        beta.kubernetes.io/os=linux
                        failure-domain.beta.kubernetes.io/region=eu-west-1
                        failure-domain.beta.kubernetes.io/zone=eu-west-1a
                        kubernetes.io/hostname=ip-10-1-43-150.eu-west-1.compute.internal
Annotations:            node.alpha.kubernetes.io/ttl=0
                        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:                 node.alpha.kubernetes.io/role=master:NoSchedule
CreationTimestamp:      Thu, 20 Apr 2017 00:36:57 +0200
Phase:
Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------  -----------------                       ------------------                      ------                          -------
  OutOfDisk             False   Tue, 25 Apr 2017 08:45:20 +0200         Thu, 20 Apr 2017 00:36:57 +0200         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Tue, 25 Apr 2017 08:45:20 +0200         Thu, 20 Apr 2017 00:36:57 +0200         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Tue, 25 Apr 2017 08:45:20 +0200         Thu, 20 Apr 2017 00:36:57 +0200         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 True    Tue, 25 Apr 2017 08:45:20 +0200         Thu, 20 Apr 2017 00:36:57 +0200         KubeletReady                    kubelet is posting ready status
Addresses:              10.1.43.150,10.1.43.150,ip-10-1-43-150.eu-west-1.compute.internal,ip-10-1-43-150.eu-west-1.compute.internal
Capacity:
 cpu:           2
 memory:        4049536Ki
 pods:          110
Allocatable:
 cpu:           2
 memory:        3947136Ki
 pods:          110
System Info:
 Machine ID:                    8e025a21a4254e11b028584d9d8b12c4
 System UUID:                   EC235DB6-B862-7550-8FF8-BB7847BEBAF8
 Boot ID:                       d739a8fb-6f0b-4e1d-baa2-e2a7db373e22
 Kernel Version:                4.9.16-coreos-r1
 OS Image:                      Container Linux by CoreOS 1298.7.0 (Ladybug)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.12.6
 Kubelet Version:               v1.6.1+coreos.0
 Kube-Proxy Version:            v1.6.1+coreos.0
ExternalID:                     i-0c48af722c785f04b
Non-terminated Pods:            (4 in total)
  Namespace                     Name                                                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                                                            ------------    ----------      --------------- -------------
  kube-system                   kube-apiserver-ip-10-1-43-150.eu-west-1.compute.internal                        0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-controller-manager-ip-10-1-43-150.eu-west-1.compute.internal               200m (10%)      0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-proxy-ip-10-1-43-150.eu-west-1.compute.internal                            0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-scheduler-ip-10-1-43-150.eu-west-1.compute.internal                        100m (5%)       0 (0%)          0 (0%)          0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ------------  ----------      --------------- -------------
  300m (15%)    0 (0%)          0 (0%)          0 (0%)
Events:         <none>

Worker:

Name:                   ip-10-1-43-42.eu-west-1.compute.internal
Role:
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=t2.large
                        beta.kubernetes.io/os=linux
                        failure-domain.beta.kubernetes.io/region=eu-west-1
                        failure-domain.beta.kubernetes.io/zone=eu-west-1a
                        kube-aws.coreos.com/autoscalinggroup=k8sprod-T2largeA-JQVZA9S334BT-Workers-VD3FAJXLV81K
                        kube-aws.coreos.com/launchconfiguration=k8sprod-T2largeA-JQVZA9S334BT-WorkersLC-41PSQQY31I4Y
                        kube-aws.coreos.com/role=worker
                        kubernetes.io/hostname=ip-10-1-43-42.eu-west-1.compute.internal
Annotations:            kube-aws.coreos.com/securitygroups=k8sprod-prerequisites-WorkerSecurityGroup-6O3GXX71Z193,k8sprod-Controlplane-1RJHQ7DSHBPTR-SecurityGroupWorker-FAESBHBU1F3T
                        node.alpha.kubernetes.io/ttl=0
                        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:                 <none>
CreationTimestamp:      Thu, 20 Apr 2017 11:18:23 +0200
Phase:
Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------  -----------------                       ------------------                      ------                          -------
  OutOfDisk             False   Tue, 25 Apr 2017 08:47:35 +0200         Sat, 22 Apr 2017 06:31:01 +0200         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Tue, 25 Apr 2017 08:47:35 +0200         Sat, 22 Apr 2017 06:31:01 +0200         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Tue, 25 Apr 2017 08:47:35 +0200         Sat, 22 Apr 2017 06:31:01 +0200         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 True    Tue, 25 Apr 2017 08:47:35 +0200         Sat, 22 Apr 2017 06:31:01 +0200         KubeletReady                    kubelet is posting ready status
Addresses:              10.1.43.42,10.1.43.42,ip-10-1-43-42.eu-west-1.compute.internal,ip-10-1-43-42.eu-west-1.compute.internal
Capacity:
 cpu:           2
 memory:        8178300Ki
 pods:          110
Allocatable:
 cpu:           2
 memory:        8075900Ki
 pods:          110
System Info:
 Machine ID:                    8e025a21a4254e11b028584d9d8b12c4
 System UUID:                   EC200572-41D5-2244-3240-0FE68293A76F
 Boot ID:                       2d7db8bd-2f04-472b-b46d-a2c0d5f68c5b
 Kernel Version:                4.9.16-coreos-r1
 OS Image:                      Container Linux by CoreOS 1298.7.0 (Ladybug)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.12.6
 Kubelet Version:               v1.6.1+coreos.0
 Kube-Proxy Version:            v1.6.1+coreos.0
ExternalID:                     i-024daef28f2c111dc
Non-terminated Pods:            (8 in total)
  Namespace                     Name                                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                                            ------------    ----------      --------------- -------------
  default                       geomapadmin-468890941-xz00f                                     100m (5%)       500m (25%)      350Mi (4%)      0 (0%)
  default                       ingress-nginx-2346665006-0463j                                  50m (2%)        0 (0%)          0 (0%)          0 (0%)
  default                       nginx-default-backend-2003809344-2184w                          10m (0%)        100m (5%)       20Mi (0%)       50Mi (0%)
  kube-system                   heapster-v1.3.0-504010935-3jvsb                                 194m (9%)       194m (9%)       354Mi (4%)      354Mi (4%)
  kube-system                   kube-dns-3816048056-5tpmj                                       260m (13%)      0 (0%)          140Mi (1%)      220Mi (2%)
  kube-system                   kube-proxy-ip-10-1-43-42.eu-west-1.compute.internal             0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-rescheduler-3155147949-0rm2p                               10m (0%)        0 (0%)          100Mi (1%)      0 (0%)
  kube-system                   kubernetes-dashboard-v1.5.1-lpnbb                               100m (5%)       100m (5%)       50Mi (0%)       50Mi (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ------------  ----------      --------------- -------------
  724m (36%)    894m (44%)      1014Mi (12%)    674Mi (8%)
Events:         <none>

mumoshu commented 7 years ago

@paalkr Sorry for being late in replying and thanks for the info! Your configuration seems good. Then, have you by any chance modified node labels after the pods had ben scheduled? Could you kubectl delete those problematic pods anyway and then observe if they still get re-scheduled to the worker nodes? If deleting pods doesn't fix your case, could you try to tag controller nodes and give pods node affinities to controller nodes, instead of node anti-affinities to worker nodes?

mumoshu commented 7 years ago

FYI, experimental.nodeLabels can be used to label only controller nodes.

paalkr commented 7 years ago

Thanks @mumoshu . Yes, I noticed the experimental.nodeLabels feature, and applied that wile updating kube-aws to the latest RC. I also deleted the out-of-the-box heapster deployemnt (and by that all Heapster pods), and redeployed Heapster. But still no luck.

Output from one of the controller nodes

Name:                   ip-10-1-45-70.eu-west-1.compute.internal
Role:
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=t2.medium
                        beta.kubernetes.io/os=linux
                        failure-domain.beta.kubernetes.io/region=eu-west-1
                        failure-domain.beta.kubernetes.io/zone=eu-west-1c
                        kube-aws.coreos.com/role=controller
                        kubernetes.io/hostname=ip-10-1-45-70.eu-west-1.compute.internal
Annotations:            node.alpha.kubernetes.io/ttl=0
                        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:                 node.alpha.kubernetes.io/role=master:NoSchedule
CreationTimestamp:      Fri, 28 Apr 2017 10:02:13 +0200
Phase:
Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------  -----------------                       ------------------                      ------                          -------
  OutOfDisk             False   Sun, 30 Apr 2017 18:29:04 +0200         Fri, 28 Apr 2017 10:02:13 +0200         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Sun, 30 Apr 2017 18:29:04 +0200         Fri, 28 Apr 2017 10:02:13 +0200         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Sun, 30 Apr 2017 18:29:04 +0200         Fri, 28 Apr 2017 10:02:13 +0200         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 True    Sun, 30 Apr 2017 18:29:04 +0200         Fri, 28 Apr 2017 10:02:13 +0200         KubeletReady                    kubelet is posting ready status
Addresses:              10.1.45.70,10.1.45.70,ip-10-1-45-70.eu-west-1.compute.internal,ip-10-1-45-70.eu-west-1.compute.internal
Capacity:
 cpu:           2
 memory:        4049512Ki
 pods:          110
Allocatable:
 cpu:           2
 memory:        3947112Ki
 pods:          110
System Info:
 Machine ID:                    8e025a21a4254e11b028584d9d8b12c4
 System UUID:                   EC21EB49-60D8-2923-FFA2-D2D16B6A97A6
 Boot ID:                       a91ab878-68a4-4a44-ac50-34c7d3211920
 Kernel Version:                4.9.24-coreos
 OS Image:                      Container Linux by CoreOS 1353.7.0 (Ladybug)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.12.6
 Kubelet Version:               v1.6.2+coreos.0
 Kube-Proxy Version:            v1.6.2+coreos.0
ExternalID:                     i-08934667737e24263
Non-terminated Pods:            (4 in total)
  Namespace                     Name                                                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                                                            ------------    ----------      --------------- -------------
  kube-system                   kube-apiserver-ip-10-1-45-70.eu-west-1.compute.internal                         0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-controller-manager-ip-10-1-45-70.eu-west-1.compute.internal                200m (10%)      0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-proxy-ip-10-1-45-70.eu-west-1.compute.internal                             0 (0%)          0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-scheduler-ip-10-1-45-70.eu-west-1.compute.internal                         100m (5%)       0 (0%)          0 (0%)          0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ------------  ----------      --------------- -------------
  300m (15%)    0 (0%)          0 (0%)          0 (0%)
Events:         <none>

Heapster deployment definition

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: heapster-v1.3.0
  namespace: kube-system
  labels:
    k8s-app: heapster
    kubernetes.io/cluster-service: "true"
    version: v1.3.0
spec:
  replicas: 3
  # selector:
    # matchLabels:
      # k8s-app: heapster
      # version: v1.3.0
  template:
    metadata:
      labels:
        k8s-app: heapster
        version: v1.3.0
      # annotations:
        # scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      # nodeSelector:
        # # kube-aws.coreos.com/role: controller
        # kubernetes.io/hostname: ip-10-1-45-239.eu-west-1.compute.internalal             
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: 'node.alpha.kubernetes.io/role'
        operator: Equal
        value: master
        effect: NoSchedule
      affinity:
        podAntiAffinity:
        # podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - heapster
            topologyKey: kubernetes.io/hostname 
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kube-aws.coreos.com/role
                operator: In
                values:
                - controller
              # - key: kubernetes.io/hostname
                # operator: In
                # values:
                # - ip-10-1-45-239.eu-west-1.compute.internal                
      containers:
        - image: gcr.io/google_containers/heapster:v1.3.0
          name: heapster
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8082
              scheme: HTTP
            initialDelaySeconds: 180
            timeoutSeconds: 5
          resources:
            limits:
              cpu: 80m
              memory: 200Mi
            requests:
              cpu: 80m
              memory: 200Mi
          command:
            - /heapster
            - --source=kubernetes.summary_api:''
        - image: gcr.io/google_containers/addon-resizer:1.6
          name: heapster-nanny
          resources:
            limits:
              cpu: 50m
              memory: 90Mi
            requests:
              cpu: 50m
              memory: 90Mi
          env:
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: MY_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          command:
            - /pod_nanny
            - --cpu=80m
            - --extra-cpu=4m
            - --memory=200Mi
            - --extra-memory=4Mi
            - --threshold=5
            - --deployment=heapster-v1.3.0
            - --container=heapster
            - --poll-period=300000
            - --estimator=exponential

List of controller nodes

ip-10-1-45-70.eu-west-1.compute.internal
ip-10-1-43-190.eu-west-1.compute.internal

Heapster only running on workers

heapster-v1.3.0-504010935-7zmr9                                     2/2       Running   0          2d        10.200.31.7   ip-10-1-43-118.eu-west-1.compute.internal
heapster-v1.3.0-504010935-qjrfx                                     2/2       Running   0          2d        10.200.14.6   ip-10-1-44-100.eu-west-1.compute.internal
heapster-v1.3.0-504010935-wh2lx                                     2/2       Running   0          2d        10.200.15.4   ip-10-1-45-91.eu-west-1.compute.internal

mumoshu commented 7 years ago

@paalkr Sorry for the long silence! Hmm, that's strange. Would you mind sharing me the output from kubectl get no --show-labels?

Only possible causes in my mind for now are a bug in k8s which makes nodeAffinity broken or a bug in kube-aws which adds uniform labels to all the nodes not only controller but also worker.

mumoshu commented 7 years ago

@paalkr Any updates on this?

Also - is there any chance you had used an old version of kubectl on your machine (rather than the one run by kube-aws within the cluster) to update the heapster deployment? An old kubectl would strip all the fields like affinity and tolerations added in k8s 1.6.

If that's the case, I'd suggest you to upgrade kubectl first and test if it works, and then modify cloud-config-controller to instruct kube-aws to do it for you.

paalkr commented 7 years ago

I discovered that if I disable the nanny heapster container it worked. The nanny is responsible for scaling the pod, any reason in particular to use the nanny rather then hpa for scaling heapster?

mumoshu commented 7 years ago

@paalkr Thanks for the reply!

According to the last changed date of it, nanny seems to use an older version of kubernetes client to communicate with k8s.

Also, how it updates a k8s deployment is problematic - it is using Update instead of Patch. It certainly strips all the recently introduced fields like taints and affinity when it reads and then overwrites the heapster deployment.

So, I'd suggest you to not use addon_resizer, or send a PR to fix the problem.

Also, I guess what you want is VPA rather than HPA as scaling "out" heapster doesn't make sense? What you want is scaling "up" of heapster, right? Anyway, VPA is under-development in the kubernetes/autoscaler repo though.

paalkr commented 7 years ago

@mumoshu , thanks for the confirmation. I did come to the conclusion that the nanny somehow interfered with my desired goal. And you just confirmed and pinpointed the problem, that the nanny actually is using an old version of the k8s client not compatible with taints and affinity ;)

I'm not sure if I understand what you mean by scaling heapster up instead of out. Not much documentation yet of the VPA initiative regarding the objective and mechanics, or do I look at the wrong location?

paalkr commented 7 years ago

I'm sorry to "bump" this issue. But are there any news in regards to how to solve this problem?

redbaron commented 7 years ago

I followed discussion and can't understand why running kube-system nodes on workers is something you try to avoid. It doesn't matter where pods are running, nodes will go down on updates.

You can shorter that period by enabling nodeDrainer, which was updated recently and now evicts pods correctly, so downtime would be only time needed to start a pod on a new node.

mumoshu commented 7 years ago

@paalkr

I'm not sure if I understand what you mean by scaling heapster up instead of out.

Scaling up here means that you provide more resources(cpu, memory) to the single heapster pod so that heapster can keep correcting metrics even when your cluster gets larger. Nanny and VPA automates this process. AFAIK heapster doesn't scale by adding replicas - so scaling "out" isn't an option.

mumoshu commented 7 years ago

@redbaron Thanks for chiming in 👍 I guess @paalkr's original explanation on the problem adds the context:

When I'm testing auto scaling of the worker node pools I see that some system critical pods are running on the worker nodes. Once in a while when a worker node is terminated by AWS, the critical pods are then terminated and redeployed to any running worked node. Not a big problem, but for example heapster statistics will not be available for the short period of time it takes Kubernetes to restart the pod on a running node.

Suppose you want to enable cluster-autoscaling on your cluster(with CA or AWS-native autoscaling), you can keep running pods on worker nodes as long as a deployment has 2 or more replicas and it isn't "critical"(note that, cluster-autoscaler doesn't try to terminate nodes running critical pods).

For critical single pods like rescheduler, heapster, kube-dns-autoscaler, cluster-autoscaler, (and tiller if you'd like to mark it "critical"), I understand that some people would like schedule them on controller nodes so that (1) they're not affected by cluster autoscaling to incur downtime when AWS-native autoscaling is enabled on worker nodes (2) k8s cluster-autoscaler is able to delete nodes as there would be no worker nodes running critical pods

@paalkr Do I understand your problem correctly?

mumoshu commented 7 years ago

Just to be sure:

kube-dns pods should run on worker nodes
heapster deployment should have only 1 replica. Setting replicas: 2 on it doesn't make sense

redbaron commented 7 years ago

No, question had nothing about cluster autoscaler or I fail to see it:

Once in a while when a worker node is terminated by AWS,

Once in a while AWS terminates VMs and really doens't care whether your VM is a Nodepool or Controller or Etcd ASG, so moving critical pod from one ASG to another buys you nothing in that regard.

I agree with your argument that there is a use-case to run certain pods on controllers, thanks for clarification.

mumoshu commented 7 years ago

@redbaron Thanks,

Once in a while AWS terminates VMs and really doens't care whether your VM is a Nodepool or Controller or Etcd ASG, so moving critical pod from one ASG to another buys you nothing in that regard.

Yes, I agree with you. There's no way to avoid short downtime in such case! Thanks for the clarification too 👍

paalkr commented 7 years ago

@mumoshu , @redbaron

Absolutely correct. EC2 instances will fail, it's just a matter of time. And if that instance happens to be the node running heapster you will be out of statistic while the pod is moved to a running node, and a replacing instance added back to the cluster. My goal here was actually to run at least two instances of heaspter in parallel on different nodes, but I understand now that heaspter won't be particularly happy about having any siblings playing alongside ;)

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-incubator/kube-aws/issues/566#issuecomment-504086747): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / kube-aws

Force running all system pods on the control plane nodes #566