kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.07k stars 3.97k forks source link

On-premise scaling to AWS #5595

Closed nemcikjan closed 1 month ago

nemcikjan commented 1 year ago

Which component are you using?:

cluster autoscaler

Describe the solution you'd like.:

I have a need to scale from on premise k8s cluster into AWS and I'm not able to get it working. I created ASG and provided sufficient AWS credentials. When I try to deploy a pod with node selector matching the ASG labels, the instance is not spun up. Moreover, the cluster autoscaler pod is automatically restarted every e.g. 15 mins. Any ideas/suggestions how to get this working? Many thanks

Additional context.: This is basically the log that is constantly appearing until the pod restarts.

I0314 18:51:56.626978       1 auto_scaling_groups.go:386] Regenerating instance to ASG map for ASGs: [<asg_name>]
I0314 18:51:56.742505       1 auto_scaling_groups.go:154] Registering ASG <asg_name>
I0314 18:51:56.742559       1 aws_wrapper.go:281] 0 launch configurations to query
I0314 18:51:56.742572       1 aws_wrapper.go:282] 1 launch templates to query
I0314 18:51:56.742591       1 aws_wrapper.go:298] Successfully queried 0 launch configurations
I0314 18:51:56.778379       1 aws_wrapper.go:309] Successfully queried 1 launch templates
I0314 18:51:56.778438       1 auto_scaling_groups.go:435] Extracted autoscaling options from "<asg_name>" ASG tags: map[]
I0314 18:51:56.778464       1 aws_manager.go:266] Refreshed ASG list, next refresh after 2023-03-14 18:52:56.778457442 +0000 UTC m=+61.554755784
I0314 18:51:56.778928       1 main.go:305] Registered cleanup signal handler
I0314 18:51:56.779120       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0314 18:51:56.779164       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 18.439µs
I0314 18:52:06.780107       1 static_autoscaler.go:235] Starting main loop
E0314 18:52:06.782121       1 static_autoscaler.go:290] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got rke-2://<node_name>
gregth commented 1 year ago

What happens:

  1. Your cluster autoscaler is configured to work with AWS.
  2. It processes the existing nodes in the cluster and tries to extract specific information from them (in order to build template nodes for its simulations), based on the node.Spec.ProviderID field of the Node object.
  3. Since these nodes are on-prem (using Rancher K8s, if I get it right), the node.Spec.ProviderID does not match the AWS-valid format aws:///<zone>/<name>, thus the autoscaler fails.

I think that currently, the cluster autoscaler can not support multi-platform clusters, I will let someone provide an authoritative confirmation on this and if there are any plans to offer support in future versions.

nemcikjan commented 1 year ago

@gregth Thank you for your response. I tried to looked into the code and made the same assumption, that multi-platform clusters, in this case on-prem rke2 and EC2 nodes, are not supported.

ctrox commented 1 year ago

Which version of cluster-autoscaler are you running? This should work now with rke2 nodes on AWS since https://github.com/kubernetes/autoscaler/pull/5361, which should have made it into cluster-autoscaler-1.26.0.

nemcikjan commented 1 year ago

@ctrox I was running v1.24. I just tried v1.26 but it's still the same. But I think you misunderstood our intentions. We are running rke2 nodes on on-premise cluster not using rancher just plain rke2 and want to scale out to run rke2 nodes in AWS, so we are not running rke2 nodes in AWS yet. The question is, if it would help to have an arbiter node running in AWS on EC2?

ctrox commented 1 year ago

@JanNemcik Ah right, I have misunderstood your setup then. Not sure if what you are trying to do is supported yet.

nemcikjan commented 1 year ago

@ctrox do you even think that this use case does make sense and is it possible that it will be implemented in the future?

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

voliveira-tmx commented 1 year ago

Hey! Did you manage to make this work?

nemcikjan commented 1 year ago

@voliveira-tmx nope

voliveira-tmx commented 1 year ago

This is a very interesting use case, that I wanted to implement. I haven't tried it myself though, I was trying to gather some knowledge first, but I haven't been able to find any practical examples on how to set this up. Any updates on this @ctrox?

ctrox commented 1 year ago

This is not something I'm working on as it affects the AWS provider of cluster-autoscaler and might even need changes in the core cluster-autoscaler to make this work. I just maintain the rancher provider which is not involved here, I just thought it was in the beginning.

nemcikjan commented 1 year ago

This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.

Shubham82 commented 11 months ago

/remove-lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Shubham82 commented 6 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Shubham82 commented 3 months ago

This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.

@nemcikjan, as mentioned in the above comment, do you plan to open the issue with a more generic description?

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/5595#issuecomment-2393497520): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.