More debug verbosity - Githubissues

yaron-idan commented 6 years ago

We have run into a few cases of the autoscaler stuck for a long time on this type of loop

2017-12-17 11:24:45,581 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2017-12-17 11:24:45,581 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2017-12-17 11:24:45,582 - autoscaler.engine_scaler - INFO - node: <redacted>                                                       state: busy
2017-12-17 11:24:45,583 - autoscaler.engine_scaler - INFO - node: <redacted>                                                       state: busy
2017-12-17 11:24:45,584 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++
2017-12-17 11:25:45,644 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2017-12-17 11:25:45,982 - autoscaler.cluster - INFO - Pods to schedule: 1
2017-12-17 11:25:45,982 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2017-12-17 11:25:45,982 - autoscaler.cluster - INFO - Nodes: 2
2017-12-17 11:25:45,982 - autoscaler.cluster - INFO - To schedule: 1
2017-12-17 11:25:45,982 - autoscaler.cluster - INFO - Pending pods: 1
2017-12-17 11:25:45,982 - autoscaler.cluster - DEBUG - <redacted>
2017-12-17 11:25:45,982 - autoscaler.scaler - INFO - ====Scaling for 1 pods ====
2017-12-17 11:25:45,983 - autoscaler.scaler - DEBUG - units_needed: 1
2017-12-17 11:25:45,983 - autoscaler.scaler - DEBUG - units_requested: 1
2017-12-17 11:25:45,983 - autoscaler.scaler - DEBUG - devops actual capacity: 2 , units requested: 1
2017-12-17 11:25:45,983 - autoscaler.scaler - INFO - New capacity requested for pool devops: 3 agents (current capacity: 2 agents)
2017-12-17 11:25:45,983 - autoscaler.scaler - DEBUG - remaining pending: 0
2017-12-17 11:25:45,983 - autoscaler.deployments - INFO - Requested a new deployment with unchanged pool sizes, skipping.
{"pod_name": "<redacted>", "pod_id": "<redacted>", "_log_streaming_target_mapping": "kubernetes-acs-engine-autoscaler", "units_requested": {"devops": 3}, "message": "scale", "time": "2017-12-17T11:25:45.983602"}
2017-12-17 11:25:46,183 - autoscaler.notification - DEBUG - SLACK: ok
2017-12-17 11:25:46,183 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++

We feel it happens because of a bug in the autoscaler comparing k8s metrics with the responses coming from the azure-sdk, but only the k8s data is being logged. I would like to add more debug logging, mostly the acs-deployment being requested, the pool size it wishes to change to etc.

WDYT?

wbuchwalter commented 6 years ago

New capacity requested for pool devops: 3 agents (current capacity: 2 agents)

The current pool size is 2 and the new requested size is 3

Requested a new deployment with unchanged pool sizes, skipping.

This means that while there currently are only 2 nodes in the pool, the autoscaler already created a deployment with 3 nodes in the devops pool. So most likely the deployment is still going on so the autoscaler does not redeploy a second time.

Deploying a new node should take somewhere between 4 to 12 minutes depending on the VM type.

Does this answer your question?

yaron-idan commented 6 years ago

I assumed this is the reason thanks to this comment - https://github.com/wbuchwalter/Kubernetes-acs-engine-autoscaler/blob/master/autoscaler/deployments.py#L14. I just thought it would be helpful to get more debug info about the data used in this condition (requested pool sizes and current pool sizes)

This is probably something for a separate issue but we had a case when the cluster was stuck in this state for several hours, and was only released when the stress got high enough to trigger the scaling of an additional node.

wbuchwalter commented 6 years ago

Yep, I understand what you mean.

The issue is that from the autoscaler, there isn't really a way to know if the VM hasn't joined because it's still being created, or if there was an issue with kubectl or anything else, so there isn't any relevant info that could be logged here. I could still do some periodic requests to azure to check the state of the deployment and report if it fails, but the deployment might report as successful while the VM has trouble joining the kubernetes cluster, so this would be an incomplete solution...

yaron-idan commented 6 years ago

I thought of another way to solve this. Whenever the autoscaler reaches this state, we could use the azure cli to query the launch time of the instance launched by the deployment and at a certain point declare the instance as faulty, destroying it and launching a new one. This might require implementing some new functionality but I think it's a good way of retrying failed scaling attempts.

veeshall commented 6 years ago

I am in a similar scenario. The auto-scaler triggered a deployment, adding 2 nodes. However that deployment failed and the VMs did not join the cluster. How to get out of this log-jam? Should I delete the auto-scaler and re-deploy the auto-scaler again? (and clean up the failed deployment components on Azure)

veeshall commented 6 years ago

Here is what I did. 1) Got the nodes that Kubernetes was reporting as "Ready" 2) In Azure, identified the VMs that did not join the cluster (or were in a failed state) 3) Removed those VMs, (and their NICs and OSDisks from storage accounts) 4) Deleted the auto-scaler 5) Re-installed the auto-scaler Now it is working again.

If auto-scaler resets itself to look at the initial deployment state, instead of the updated state (which might not be the current state anyway) - would that help?

Of course, "failed deployment states and their cleanup" - still needs to be addressed

wbuchwalter / Kubernetes-acs-engine-autoscaler

More debug verbosity #70