bulk scale-up in azure creates only one node per iteration sometimes

palmerabollo commented 5 years ago

I think that cluster-autoscaler (CA) 1.3.x in Azure has problems dealing with affinity rules.

I use the following deployment to deploy a "pause" pod with two rules:

affinity: They must use a node in the agentpool named "genmlow"
podAntiAffinity: pods must not be deployed in the same node

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  labels:
    app: pause
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: poolName
                    operator: In
                    values:
                      - genmlow
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - pause
              topologyKey: kubernetes.io/hostname
      containers:
      - image: "karlherler/pause:1.0"
        name: pause

The agentpool "genmlow" uses Standard_DS2_v2 machines (8GB) in a virtual machine scale set.

When I scale the number of replicas to 10 (kubectl scale deployment pause --replicas=10), I see that the cluster autoscaler (version 1.3.9, k8s 1.11.8) creates only one node per iteration, as if it was ignoring the affinity rules. See cluster-autoscaler logs, where nodes go from 0->1->2->...->N.

I0503 14:03:19.299146       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2019-05-03 14:04:19.2991386 +0000 UTC m=+948.211672501
I0503 14:03:19.993383       1 scale_up.go:249] Pod default/pause-66cf84dcdb-2khzb is unschedulable
I0503 14:03:19.993412       1 scale_up.go:249] Pod default/pause-66cf84dcdb-l7587 is unschedulable
I0503 14:03:19.993418       1 scale_up.go:249] Pod default/pause-66cf84dcdb-t5mb8 is unschedulable
I0503 14:03:19.993422       1 scale_up.go:249] Pod default/pause-66cf84dcdb-xp2kn is unschedulable
I0503 14:03:19.993426       1 scale_up.go:249] Pod default/pause-66cf84dcdb-rpskf is unschedulable
I0503 14:03:19.993429       1 scale_up.go:249] Pod default/pause-66cf84dcdb-kkxc5 is unschedulable
I0503 14:03:19.993433       1 scale_up.go:249] Pod default/pause-66cf84dcdb-lbprj is unschedulable
I0503 14:03:19.993437       1 scale_up.go:249] Pod default/pause-66cf84dcdb-lmwmf is unschedulable
I0503 14:03:19.993441       1 scale_up.go:249] Pod default/pause-66cf84dcdb-c8njm is unschedulable
I0503 14:03:19.993446       1 scale_up.go:249] Pod default/pause-66cf84dcdb-gg6xh is unschedulable
...
I0503 14:03:20.071931       1 utils.go:187] Pod pause-66cf84dcdb-kkxc5 can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.072229       1 utils.go:187] Pod pause-66cf84dcdb-lbprj can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.072529       1 utils.go:187] Pod pause-66cf84dcdb-lmwmf can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.073242       1 utils.go:187] Pod pause-66cf84dcdb-c8njm can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
...
I0503 14:03:20.076758       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:03:20.076770       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:03:20.076783       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 0->1 (max: 1000)}]
I0503 14:03:20.076796       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 1
...
I0503 14:06:13.334377       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:06:13.334411       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:06:13.334470       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 1->2 (max: 1000)}]
I0503 14:06:13.334503       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 2
...
I0503 14:09:02.059191       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:09:02.059243       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:09:02.059310       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 2->3 (max: 1000)}]
I0503 14:09:02.059350       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 3
...
I0503 14:11:50.214206       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:11:50.214228       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:11:50.214245       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 3->4 (max: 1000)}]
I0503 14:11:50.214262       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 4
...
...

However, it only behaves this way when the pod has no requests. If I add the following requests:

  resources:
    requests:
      memory: 5Gi

Everything works as expected and the cluster autoscaler works as expected, creating the 10 virtual machines in a single batch (1->10). I guess it is because this time the autoscaler knows that it can not fit two pods in a single node (5Gi + 5Gi > 8GB), even if it still ignoring the affinity rules.

I0503 14:31:36.574678       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:31:36.574722       1 scale_up.go:382] Estimated 10 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:31:36.574752       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 0->10 (max: 1000)}]
I0503 14:31:36.574786       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 10

It looks like a bug to me. Using the same setup in AWS (cluster autoscaler 1.2.x instead of 1.3.x is the only difference) works fine, and the CA creates the 10 virtual machines no matter whether you specify the container memory requests or not.

MaciekPytel commented 5 years ago

It's a known issue with pod affinity / antiaffinity: https://github.com/kubernetes/autoscaler/issues/257#issuecomment-364449232. The details are on the issue I linked, but in general pod affinity and (especially) antiaffinity don't work well with CA. It can cause CA to only add nodes one by one as you observe and it completely breaks CA performance on large clusters. It's not easy to fix, because it's caused by pod affinity being implemented in a way that is conceptually incompatible with how autoscaler works. To fix it we'd need a significant refactor of either scheduler or autoscaler, neither of which is likely to happen soon.

palmerabollo commented 5 years ago

Thanks @MaciekPytel. What I don't understand is why it works well on AWS. Shouldn't that logic be shared among all cloud implementations?

feiskyer commented 5 years ago

/assign

kubernetes / autoscaler

bulk scale-up in azure creates only one node per iteration sometimes #1984