kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.9k stars 3.91k forks source link

Feature Request: Specify placement group with Hetzner Autoscaler #5919

Closed dominic-p closed 3 weeks ago

dominic-p commented 1 year ago

Which component are you using?: cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Hetzner Placement Groups allow you to spread your VMs across different physical hardware which decreases the probability that some instances might fail together.

Describe the solution you'd like.: It would be nice be able to specify a given placement group for nodes managed by the autoscaler just like we can currently specify things like the network or SSH key. Maybe a new env variable like HCLOUD_PLACEMENT_GROUP could be introduced.

Describe any alternative solutions you've considered.: Of course, the status quo is fine. This would just make the cluster a bit more robust.

apricote commented 10 months ago

Nice suggestion. Placement Groups have a limit of 10 instances, so using them for all your Nodes might become a Problem.

This can be introduced more nicely in the new JSON Format for Node Group configuration: HCLOUD_NODE_CONFIG

/area provider/hetzner

dominic-p commented 10 months ago

Thanks for the reply! I wasn't aware of the new env variable. Yes, that doesn't seem like a good place to configure it. It's not entirely clear to me how it will interact with the exist variables it's replacing. If both HCLOUD_CLOUD_INIT and HCLOUD_CLUSTER_CONFIG are set, which one will take precedence?

So, if we were to add support for placement groups (acknowledging the 10 node limit), the JSON might look like this?

{
    "nodeConfigs": {
        "placementGroup": "name or ID here"
    }
}

Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?

apricote commented 9 months ago

HCLOUD_CLUSTER_CONFIG will override HCLOUD_CLOUD_INIT: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_node_group.go#L416-L420

The JSON should look like this:

{
  "nodeConfigs": {
    "pool-1": {
      "placementGroup": "name or ID here"
    }
}

Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?

I would prefer if we would fail more loudly by just refusing to start. This is how the config validation is done right now, see klog.Fatalf: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_cloud_provider.go#L203-L215

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dominic-p commented 6 months ago

Not stale for me.

apricote commented 6 months ago

@dominic-p are you interested in implementing this? I can help out if you have any questions :)

/remove-lifecycle stale

dominic-p commented 6 months ago

Thanks for following up. I can take a stab at it. Go's not my language, unfortunately, but the change may be trivial enough for me to muddle through. Can you point me in the direction (e.g. which file(s) I should start looking at)?

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dominic-p commented 3 months ago

Still not stale for me. Still struggling to find the time to work on it.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dominic-p commented 2 months ago

Just checking in on this again.

If someone can point me in the right direction, I can take a stab at a PR for this. As I mentioned before, Go isn't a language I know, but it should be pretty straightforward.

apricote commented 1 month ago

Some steps:

  1. Add new field PlacementGroup string to struct NodeConfig in hetzner_manager.go
  2. Add new field PlacementGroup *hcloud.PlacementGroup to struct hetznerNodeGroup in hetzner_node_group.go
  3. In hetzner_cloud_provider.go BuildHetzner in line 217: If a PlacementGroup is specified in NodeConfigs[spec.name]: check with the API that the placement group exists. Add it to the created hetznerNodeGroup in Line 219
  4. After the loop over all nodeGroups (Line 230): Check that each unique placement group has a max sum of 10 maxSize over all nodeGroups that have specified this placement group.
dominic-p commented 1 month ago

Thank you so much for taking the time to lay that out for me. I took a shot at a PR. Let me know what you think.

As I said, go is not my language, but I think I was able to muddle through the steps you gave me.

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/5919#issuecomment-2264728760): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.