Open apricote opened 2 weeks ago
/assign
/triage accepted
Update from internal conversations:
We plan to implement a workaround for known current users of cluster-autoscaler
. This will still cause issues for new users unless a new version of cluster-autoscaler
is released.
The workaround will only be available for ~2 weeks after new releases are cut.
We will inform impacted customers about this so they can update before it starts breaking.
The Hetzner provider in current versions of cluster-autoscaler has a bug and relies on the CX11 server type, which we will remove from our ordering options on 6 September 2024.
If you try to use the cluster-autoscaler provider after that date, you will see the following error messages:
mixed_nodeinfos_processor.go:160] Unable to build proper template node for draining-node-pool: failed to create resource list for node group draining-node-pool error: failed to get machine type cx11 info error: server type not found
static_autoscaler.go:387] Failed to get node infos for groups: failed to create resource list for node group draining-node-pool error: failed to get machine type cx11 info error: server type not found
The following versions of cluster-autoscaler are affected:
We depend on the Kubernetes community and the maintainers of cluster-autoscaler to release new versions. We expect that new official versions are released at the end of September.
To bridge the gap until the Kubernetes community releases the new versions, we published alternative container images of cluster-autoscaler that include a patch for the bug. You can use these in your deployment, but we will remove them one month after new official cluster-autoscaler versions become available. We will not provide any other patch releases on this container image repository. Please switch back to the official images as soon as possible.
docker.io/hetznercloud/cluster-autoscaler:v1.28.6-hcloud1
(Build Commit)docker.io/hetznercloud/cluster-autoscaler:v1.29.4-hcloud1
(Build Commit)docker.io/hetznercloud/cluster-autoscaler:v1.30.2-hcloud1
(Build Commit)docker.io/hetznercloud/cluster-autoscaler:v1.31.0-hcloud1
(Build Commit)To prevent disruptions for existing users of the provider, we will keep the CX11 server type available for these accounts. We will remove that prolonged access to the CX11 server type two weeks after the Kubernetes community releases new versions of cluster-autoscaler.
Which component are you using?:
cluster-autoscaler Hetzner provider
/area provider/hetzner /area cluster-autoscaler
What version of the component are you using?:
Component version: All current versions
What k8s version are you using (
kubectl version
)?:Does not matter
What environment is this in?:
Hetzner Cloud
What did you expect to happen?:
The Hetzner Cloud provider should continue to work after 2024-09-06.
What happened instead?:
The Hetzner Cloud provider will stop working on 2024-09-06.
How to reproduce it (as minimally and precisely as possible):
xyz123
)Observe error messages:
mixed_nodeinfos_processor.go:160] Unable to build proper template node for draining-node-pool: failed to create resource list for node group draining-node-pool error: failed to get machine type xyz123 info error: server type not found static_autoscaler.go:387] Failed to get node infos for groups: failed to create resource list for node group draining-node-pool error: failed to get machine type xyz123 info error: server type not found
Anything else we need to know?:
The server type
cx11
was deprecated on 2024-06-06. It will be removed from the API on 2024-09-06: https://docs.hetzner.cloud/changelog#2024-06-06-old-server-types-with-shared-intel-vcpus-are-deprecatedThe server type is hardcoded for a
draining-node-pool
, which is not actually used anywhere in the provider. It is only added to the list of known node pools.Two options:
Replace
cx11
by the replacement typecx22
This is minimally invasive, but has the same problem that we are hardcoding a value that might change or be deprecated.
Remove
draining-node-pool
completely from the codeThis feels like the clean choice, as this node pool is not used internally. However, this is a user visible change (node pool will disappear from the status config map), so I am not sure if we can backport this to previous releases.