wbuchwalter / Kubernetes-acs-engine-autoscaler

[Deprecated] Node-level autoscaler for Kubernetes clusters created with acs-engine.
Other
71 stars 22 forks source link

Spare agent not honoured when node in state idle-unschedulable? #73

Closed omerlh closed 6 years ago

omerlh commented 6 years ago

I'm running autoscaller using the following parameters:

 "command": [
              "python",
              "main.py",
              "--resource-group",
              "solutohome1-kubernetes7",
              "--acs-deployment",
              "solutohome1-kubernetes7-1794445748",
              "--idle-threshold",
              "120",
              "-vvv",
              "--spare-agents",
              "2"
            ],

The cluster should maintain at least 2 worker nodes. So I run some contaienrs to create load on the cluster so the autoscaller will scale from 1 node to 2 nodes (according to #66). After I turned off the containers, the autoscaller scale down the new node:

2017-12-21 14:51:58,008 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2017-12-21 14:51:58,008 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2017-12-21 14:51:58,009 - autoscaler.engine_scaler - INFO - node: <>-0                                                       state: under-utilized-undrainable
2017-12-21 14:51:58,010 - autoscaler.engine_scaler - INFO - node: <>-1                                                       state: under-utilized-undrainable
2017-12-21 14:51:58,010 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2017-12-21 14:52:58,050 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - Pods to schedule: 0
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - Nodes: 2
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - To schedule: 0
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - Pending pods: 0
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++
2017-12-21 14:52:58,221 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2017-12-21 14:52:58,221 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2017-12-21 14:52:58,221 - autoscaler.engine_scaler - INFO - node: <>-0                                                       state: idle-unschedulable
2017-12-21 14:52:58,222 - autoscaler.engine_scaler - INFO - node: <>-1                                                       state: under-utilized-undrainable
2017-12-21 14:52:58,223 - autoscaler.engine_scaler - INFO - deleting <>-0
2017-12-21 14:53:06,132 - autoscaler.engine_scaler - INFO - Deleting VM for -0
2017-12-21 14:56:12,777 - autoscaler.engine_scaler - INFO - Deleting NIC for <>-0

By looking at the code (this line in engine_scaler.py), it looked like the spare agents is honored only when the node is in state UNDER_UTILIZED_DRAINABLE. In my case, node 0 (the one that provisioned and later deleted), was in state idle-unschedulable and I'm suspecting this is why it deleted.

wbuchwalter commented 6 years ago

Sorry for the delay. It's hard to say, I would need more logs to be able to tell, but basically a node in undeschedulable state means it has already been cordoned and drained. The autoscaler will look at the number of spare agents, and it should never cordon and drain a node if the number of available node is <= 2. However once a node is cordoned and drained, the AS will delete it no matter the number of nodes available, which is what is happening here. So the question is how did this node ended up cordoned in the first place? Is it possible that you did a manual kubectl cordon on the node?

omerlh commented 6 years ago

No, pretty sure I did not. The weird part is that is somehow fixed itself - now look like the autoscaler behave normally...

wbuchwalter commented 6 years ago

I'll close this issue then as I can't really take any action on it, but if the issue comes back feel free to let me know I'll reopen.

shaikatz commented 6 years ago

@wbuchwalter Can you please reopen the issue? I'm tackling the same behavior, here is my full log and how I've ended up with one node with 2 spare agents configuration:

2018-03-08 14:39:28,026 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2018-03-08 14:39:28,026 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2018-03-08 14:39:28,027 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-0                                                       state: under-utilized-undrainable
2018-03-08 14:39:28,028 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-1                                                       state: under-utilized-undrainable
2018-03-08 14:39:28,028 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-2                                                       state: under-utilized-undrainable
2018-03-08 14:39:28,028 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-3                                                       state: under-utilized-undrainable
2018-03-08 14:39:28,029 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-4                                                       state: under-utilized-undrainable
2018-03-08 14:39:28,029 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2018-03-08 14:40:28,076 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - Pods to schedule: 0
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - Nodes: 5
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - To schedule: 0
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - Pending pods: 0
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++
2018-03-08 14:40:28,163 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2018-03-08 14:40:28,163 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2018-03-08 14:40:28,164 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-0                                                       state: under-utilized-undrainable
2018-03-08 14:40:28,165 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-1                                                       state: under-utilized-drainable
2018-03-08 14:40:28,209 - autoscaler.kube - INFO - cordoned k8s-node-24158305-1
2018-03-08 14:40:28,209 - autoscaler.kube - INFO - Deleting Pod default/prometheus-prometheus-node-exporter-lb45v
2018-03-08 14:40:28,242 - autoscaler.kube - INFO - Deleting Pod kube-system/kube-proxy-hc975
2018-03-08 14:40:28,295 - autoscaler.kube - INFO - Deleting Pod kube-system/snap-dx4bd
2018-03-08 14:40:28,319 - autoscaler.kube - INFO - drained k8s-node-24158305-1
2018-03-08 14:40:28,320 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-2                                                       state: under-utilized-drainable
2018-03-08 14:40:28,357 - autoscaler.kube - INFO - cordoned k8s-node-24158305-2
2018-03-08 14:40:28,357 - autoscaler.kube - INFO - Deleting Pod default/prometheus-prometheus-node-exporter-mk5xh
2018-03-08 14:40:28,405 - autoscaler.kube - INFO - Deleting Pod kube-system/kube-proxy-qpq85
2018-03-08 14:40:28,452 - autoscaler.kube - INFO - Deleting Pod kube-system/snap-7fvp9
2018-03-08 14:40:28,525 - autoscaler.kube - INFO - drained k8s-node-24158305-2
2018-03-08 14:40:28,525 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-3                                                       state: under-utilized-drainable
2018-03-08 14:40:28,562 - autoscaler.kube - INFO - cordoned k8s-node-24158305-3
2018-03-08 14:40:28,562 - autoscaler.kube - INFO - Deleting Pod default/prometheus-prometheus-node-exporter-7px6m
2018-03-08 14:40:28,614 - autoscaler.kube - INFO - Deleting Pod kube-system/kube-proxy-fc25q
2018-03-08 14:40:28,662 - autoscaler.kube - INFO - Deleting Pod kube-system/snap-6vz6p
2018-03-08 14:40:28,693 - autoscaler.kube - INFO - drained k8s-node-24158305-3
2018-03-08 14:40:28,694 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-4                                                       state: spare-agent
2018-03-08 14:40:28,694 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2018-03-08 14:41:28,755 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2018-03-08 14:41:28,844 - autoscaler.cluster - INFO - Pods to schedule: 0
2018-03-08 14:41:28,844 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2018-03-08 14:41:28,845 - autoscaler.cluster - INFO - Nodes: 5
2018-03-08 14:41:28,845 - autoscaler.cluster - INFO - To schedule: 0
2018-03-08 14:41:28,845 - autoscaler.cluster - INFO - Pending pods: 0
2018-03-08 14:41:28,845 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++
2018-03-08 14:41:28,845 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2018-03-08 14:41:28,845 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2018-03-08 14:41:28,846 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-0                                                       state: under-utilized-undrainable
2018-03-08 14:41:28,846 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-1                                                       state: idle-unschedulable
2018-03-08 14:41:28,847 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-2                                                       state: idle-unschedulable
2018-03-08 14:41:28,847 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-3                                                       state: idle-unschedulable
2018-03-08 14:41:28,847 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-4                                                       state: under-utilized-drainable
2018-03-08 14:41:28,917 - autoscaler.kube - INFO - cordoned k8s-node-24158305-4
2018-03-08 14:41:28,918 - autoscaler.kube - INFO - Deleting Pod default/prometheus-prometheus-node-exporter-pkcwd
2018-03-08 14:41:28,935 - autoscaler.kube - INFO - Deleting Pod kube-system/kube-proxy-lwzh9
2018-03-08 14:41:28,971 - autoscaler.kube - INFO - Deleting Pod kube-system/snap-brhzv
2018-03-08 14:41:29,005 - autoscaler.kube - INFO - drained k8s-node-24158305-4
2018-03-08 14:41:29,005 - autoscaler.azure_api - INFO - deleting node k8s-node-24158305-1
2018-03-08 14:41:29,006 - autoscaler.azure_api - INFO - deleting node k8s-node-24158305-2
2018-03-08 14:41:29,007 - autoscaler.azure_api - INFO - deleting node k8s-node-24158305-3
2018-03-08 14:41:29,383 - autoscaler.azure_api - INFO - Deleting VM for k8s-node-24158305-2
2018-03-08 14:41:29,398 - autoscaler.azure_api - INFO - Deleting VM for k8s-node-24158305-1
2018-03-08 14:41:29,425 - autoscaler.azure_api - INFO - Deleting VM for k8s-node-24158305-3
2018-03-08 14:44:02,979 - autoscaler.azure_api - INFO - Deleting NIC for k8s-node-24158305-2
2018-03-08 14:44:03,390 - autoscaler.azure_api - INFO - Deleting NIC for k8s-node-24158305-1
2018-03-08 14:44:04,610 - autoscaler.azure_api - INFO - Deleting NIC for k8s-node-24158305-3
2018-03-08 14:44:13,868 - autoscaler.azure_api - INFO - Deleting OS disk for k8s-node-24158305-2
2018-03-08 14:44:13,937 - autoscaler.azure_api - INFO - Deleting OS disk for k8s-node-24158305-1
2018-03-08 14:44:15,541 - autoscaler.azure_api - INFO - Deleting OS disk for k8s-node-24158305-3
2018-03-08 14:44:15,860 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2018-03-08 14:45:15,920 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - Pods to schedule: 0
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - Nodes: 2
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - To schedule: 0
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - Pending pods: 0
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++
2018-03-08 14:45:16,020 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2018-03-08 14:45:16,020 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2018-03-08 14:45:16,021 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-0                                                       state: under-utilized-undrainable
2018-03-08 14:45:16,022 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-4                                                       state: idle-unschedulable
2018-03-08 14:45:16,022 - autoscaler.azure_api - INFO - deleting node k8s-node-24158305-4
2018-03-08 14:45:16,094 - autoscaler.azure_api - INFO - Deleting VM for k8s-node-24158305-4
2018-03-08 14:46:47,060 - autoscaler.azure_api - INFO - Deleting NIC for k8s-node-24158305-4
2018-03-08 14:46:57,752 - autoscaler.azure_api - INFO - Deleting OS disk for k8s-node-24158305-4
2018-03-08 14:46:57,952 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2018-03-08 14:47:58,013 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2018-03-08 14:47:58,074 - autoscaler.cluster - INFO - Pods to schedule: 0
2018-03-08 14:47:58,074 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2018-03-08 14:47:58,074 - autoscaler.cluster - INFO - Nodes: 1
2018-03-08 14:47:58,074 - autoscaler.cluster - INFO - To schedule: 0
2018-03-08 14:47:58,074 - autoscaler.cluster - INFO - Pending pods: 0
2018-03-08 14:47:58,074 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++
2018-03-08 14:47:58,075 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2018-03-08 14:47:58,075 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2018-03-08 14:47:58,076 - autoscaler.engine_scaler - INFO - node: k8s-node-24158305-0                                                       state: under-utilized-undrainable
2018-03-08 14:47:58,076 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2018-03-08 14:48:58,096 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2018-03-08 14:48:58,171 - autoscaler.cluster - INFO - Pods to schedule: 0
2018-03-08 14:48:58,171 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2018-03-08 14:48:58,171 - autoscaler.cluster - INFO - Nodes: 1
2018-03-08 14:48:58,171 - autoscaler.cluster - INFO - To schedule: 0
2018-03-08 14:48:58,171 - autoscaler.cluster - INFO - Pending pods: 0
2018-03-08 14:48:58,171 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++