rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.44k stars 256 forks source link

ServiceLB is broken when rke2-cloud-provider is on 1.29 but k8s version is <=1.28 #5882

Open xyzzyz opened 2 months ago

xyzzyz commented 2 months ago

Environmental Info: RKE2 Version: v1.28.9+rke2r1

Node(s) CPU architecture, OS, and Version: Linux dev 4.18.0-552.el8.x86_64 #1 SMP Sun Apr 7 19:39:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux, CentOS 8

Cluster Configuration: 1 server, 0 agents

Describe the bug: ServiceLB fails to come up

Steps To Reproduce:

Expected behavior: ServiceLB Load Balancer comes up

Actual behavior: LoadBalancer Service is stuck pending, with the following in events:

  Warning  SyncLoadBalancerFailed  1s (x6 over 2m36s)     service-controller  Error syncing load balancer: failed to ensure load balancer: failed to create kube-system/svclb-contour-envoy-b1fbfa01 apps/v1, Kind=DaemonSet for  default/contour-envoy: DaemonSet.apps "svclb-contour-envoy-b1fbfa01" is invalid: [spec.template.spec.containers[0].env[4].valueFrom.fieldRef: Forbidden: may not be set when feature gate 'PodHostIPs' is not enabled, spec.template.spec.containers[1].env[4].valueFrom.fieldRef: Forbidden: may not be set when feature gate 'PodHostIPs' is not enabled]

Additional context / logs: This is caused by version mismatch of rke2-cloud-provider and kubernetes apiserver. rke2-cloud-provider decides whether to use HostIPs ref based on what's enabled by default on the k8s version cloud-provider is compiled with. If the version is 1.29, rke2-cloud-provider believes that PodHostIPs is available, but if k8s version is actually 1.28, it's not enabled by default, so it breaks.

3 weeks ago, cloud-provider was bumped to 1.29 on 1.28 and 1.27 release branches, see e.g. this. Indeed:

$ kubectl version
Client Version: v1.28.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.9+rke2r1
$ kubectl get pods -n kube-system -A -o yaml | grep image: | grep 1.29
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240412
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240412

To fix, rke2-cloud-provider should obtain the actual feature gate state, instead of whatever's default for the given k8s version.

brandond commented 2 months ago

Yeah, the problem is that it's not set explicitly by RKE2, so it uses the default for the Kubernetes version the CCM is built against - and the K3s CCM is built against 1.29 (as you noted).

The best current work-around is to set this in your config.yaml:

kube-cloud-controller-manager-arg:
  - 'feature-gates=PodHostIPs=false'
rancher-max commented 1 month ago

I'm not sure this has changed at all in these releases. Is this meant to be in Working status still, not To Test? @brandond

I checked on release-1.28 branch commitid 2c90f3baa0dbd555d5972db542421f3d2cded7b5, and see the following still:

$ kubectl get pods -n kube-system -A -o yaml | grep image: | grep 1.29
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515

Also I will note that I'm not able to reproduce the issue exactly other than this. For my steps, I need to include enable-servicelb: true in the config.yaml and NOT disable the cloud-controller, otherwise the cluster either doesn't come up correctly or there is no svclb pod created when creating a service of type LoadBalancer.

brandond commented 1 month ago

No sorry, I think I moved this over on accident. This can go back to next up until July.