kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8k stars 3.94k forks source link

"Failed to check cloud provider" flooding logs #6096

Closed grosser closed 10 months ago

grosser commented 1 year ago

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.27.3 with patch https://github.com/kubernetes/autoscaler/pull/5887

What k8s version are you using (kubectl version)?:

v1.27.2

What environment is this in?:

aws

What did you expect to happen?:

no log spam

What happened instead?:

log spam

{"level":"WARN","message":"Failed to check cloud provider has instance for ip-xxx.us-west-2.compute.internal: node is not present in aws: could not find instance {aws:///us-west-2b/i-xyz i-xyz}"}

seems like 1 line for every instance in use

How to reproduce it (as minimally and precisely as possible):

unclear

Anything else we need to know?:

seems very similar to https://github.com/kubernetes/autoscaler/issues/5842 but the patch for it did not solve our issue

mro-amboss commented 1 year ago

I have CA 1.27.2 and having the same error. Kubernetes version: 1.24. Env: AWS EKS

brydoncheyney commented 12 months ago

Observing the same issue. The CA is logging each non-ASG node every control loop. We're provisioning "platform" nodes on Managed Node Group instances, and use karpenter to provision "worker" nodes, which should be excluded from the check.

CA: 1.27.2 / 1.28.0 Kubernetes version: 1.28 Env: AWS EKS

shapirus commented 11 months ago

Same issue here with CA 1.28.0 / k8s 1.28.2.

The ASGs are managed by kops.

I1017 10:21:10.920315       1 cluster.go:175] node i-06de46c41ffb0aaaa is not suitable for removal: can reschedule only 0 out of 1 pods
[at this point all is working fine]
...
I1017 10:25:01.973574       1 auto_scaling_groups.go:393] Regenerating instance to ASG map for ASG names: [ list-of-all-ASGs ]
I1017 10:25:02.099392       1 auto_scaling_groups.go:400] Regenerating instance to ASG map for ASG tags: map[]
I1017 10:25:02.100362       1 auto_scaling_groups.go:142] Updating ASG <ASG-that-contains-the-node-in-question>
...
W1017 10:27:32.735121       1 clusterstate.go:1033] Failed to check cloud provider has instance for i-06de46c41ffb0aaaa: node is not present in aws: could not find instance {aws:///eu-central-1c/i-06de46c41ffb0aaaa i-06de46c41ffb0aaaa}
...
I1017 10:27:32.735701       1 pre_filtering_processor.go:57] Node i-06de46c41ffb0aaaa should not be processed by cluster autoscaler (no node group config)

What does no node group config mean?

The instance, contrary to what CA says, exists and can be viewed in the AWS console. CA seems to lose it after the "Updating ASG" message.

brydoncheyney commented 11 months ago

What does no node group config mean?

The instance, contrary to what CA says, exists and can be viewed in the AWS console. CA seems to lose it after the "Updating ASG" message.

From what I understand, on each control loop the operator will scan the cluster and reconcile the node group ASG cache with the current cluster node instances. As part of this reconciliation, it compares the current cluster nodes to the cache, to update the cache state when cluster nodes have been deleted. This means it's checking all cluster nodes against the (smaller) set of nodes managed by the node group ASG, and hence why the Failed to check error is being reported, as the operator could not find instance (in the ASG cache) as those instances are not managed by the ASG. It's not that it doesn't find the node in the cloud provider but rather the node isn't managed by the node group ASG.

Similarly, when processing nodes that may be eligible for scale-down, nodes that are not managed by the autoscaler ASG are reported with the (no node group config) warning.

There was a recent related issue raised with fargate provisioned nodes also flooding the logs - this was resolved by simply stripping out those nodes based on a prefixed name convention.

The issue here appears to be HasInstance within the context of the ASG cache vs the node instances cache? Essentially, when there are (self managed, say using karpenter) nodes in the cluster that are not managed by the ASG, these will be reported on each reconciliation as "Failed to check... could not find instance" and "should not be processed (no node group config)". Even if these messages were contextualised a little better, and could be suppressed with a change to log levels, this would be an improvement.

brydoncheyney commented 11 months ago

in fact... this was summarised far more succinctly at the time in the k8s autoscaling sig agenda

fix: CA on fargate causing log flood

  • https://github.com/kubernetes/autoscaler/pull/5887
    • Scenario: EKS cluster running on fargate but uses CA to scale some non-fargate nodegroups in the cluster
    • Problem: CA reads all K8s nodes including fargate ones and checks if the instance is present in AWS (by calling HasInstance) by checking if the instance is part of some ASG. This leads to error (because fargate nodes are not part of ASGs) and causes log flooding.
shapirus commented 11 months ago

Well, in my case, the node that CA complains about in the log:

I have a suspicion that it may be because of non-unique instance group names (other clusters in the same region can have IGs with the same names) and hence non-unique k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup tag values.

As an experiment, I have now renamed the instance groups to make their names unique. Will watch the logs to see if the issue persists.

update: renaming the IG did not help. However, all of the nodes that are reported by CA as node is not present in aws and should not be processed by cluster autoscaler (no node group config) are currently control-plane nodes. The one I initially mentioned was a regular node, but I can't definitely tell that it wasn't going to be shut down at that time. I will post another update if I notice CA saying this again about any node that isn't a control-plane and is definitely not being deleted.

grosser commented 10 months ago

Can confirm that the bug is that non-autoscaling nodegroups trigger this, checked our logs and it's only api servers and etcd members which we have autoscale disabled.

grosser commented 10 months ago

Solution 1: Make autoscaler have all asgs in it's aws.awsManager.asgCache not sure what the sideeffects would be. Solution 2: Have the users annotate every node with "leave me alone" which his not great but makes the warning useful again ... maybe CA can automate that annotation 🤷 see https://github.com/kubernetes/autoscaler/pull/6265

yongzhang commented 9 months ago

I don't think hacking on kubernetes resources is a good solution, should CA focus on ASG tags for a final solution?

james-callahan commented 5 months ago

Sadly an annotation isn't a great way to do this (label might be better?), as you can't pass node annotations to kubelet when running it (see https://github.com/kubernetes/kubernetes/issues/108046)