harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.67k stars 311 forks source link

[BUG] Harvester Cloud Provider (102.0.2+up0.2.3) w/ Rancher v2.7.11-rc3 (rke v1.27.10, rke v1.26.13) DHCP LoadBalancer Stuck at "Pending" in Rancher #5247

Closed irishgordo closed 2 weeks ago

irishgordo commented 4 months ago

Describe the bug Harvester Cloud Provider w/ Rancher v2.7.11-rc3, when utilized in an RKE/RKE1 Cluster of N nodes in Harvester v1.2.1 fails to have in Rancher v2.7.11-rc3, the LoadBalancer move from a "Pending" state. Additionally, a 'taint' is on the node initially ( even in a cluster with 1 worker node & 1 etcd,cp,worker node ( it will be present on both ) ) - once removed, rancher-webhook deployment will succeed.

Tested Rancher with:


helm repo add rancher-optimus-latest https://charts.optimus.rancher.io/server-charts/bin/chart/latest

Pre-Req For Reproduce

Example Helm Based:

helm install rancher rancher-optimus-latest/rancher \
  --version 2.7.11-rc3 \
  --namespace cattle-system \
  --set hostname=rancher2-test.localnet \
  --set global.cattle.psp.enabled=false \
  --set bootstrapPassword=password1234 \
  --set rancherImageTag=v2.7.11-rc3 \
  --set rancherImage=stgregistry.suse.com/rancher/rancher \
  --set 'extraEnv[0].name=CATTLE_AGENT_IMAGE' \
  --set 'extraEnv[0].value=stgregistry.suse.com/rancher/rancher-agent:v2.7.11-rc3' \
  --set replicas=1 \
  --debug

Example Docker based:

docker run --privileged -d --name=rancher --restart=unless-stopped -p 8080:80 -p 6443:443 -e CATTLE_AGENT_IMAGE=stgregistry.suse.com/rancher/rancher-agent:v2.7.11-rc3 stgregistry.suse.com/rancher/rancher:v2.7.11-rc3

To Reproduce Steps to reproduce the behavior:

  1. Import Harvester Cluster into Rancher
  2. Add VM Network, Cloud-Img for Harvester ( utilize ubuntu-focal-current kvm optimized or ubuntu jammy current kvm optmized )
  3. Create RKE Node Template, select Engine Options -> Storage -> make sure to say overlay2
  4. Build RKE either v1.27.10 or v1.26.13 (both are reproducible)
  5. Remove node taint in RKE cluster
    taints:
    - effect: NoSchedule
     key: node.cloudprovider.kubernetes.io/uninitialized
     value: "true"
  6. Allow rancher-webhook under deploys to finish
  7. Install Harvester Cloud Provider chart
  8. Create Deployment nginx:latest with the tags of something like: service: nginx on the pod
  9. Go into Service Discovery -> Create, build LoadBalancer Service, DHCP, make sure to set selectors
  10. You'll notice DHCP LoadBalancer Service Fails to move past "pending" state. Load Balancer is built in Harvester but hangs in Rancher, never resolving

Expected behavior Load Balancer service provided through Harvester Cloud Provider to succeed in Rancher and not be hung

Support bundle supportbundle_c5f6a4f6-7dd0-4c79-88eb-6fef88a6cb09_2024-02-28T23-34-09Z.zip

Environment

Additional context Screenshot from 2024-02-28 15-35-39

Errors Noticed:

EnsuringLoadBalancer Service anothernginx Ensuring load balancer anothernginx.17b82a1a7da4b7d8 Wed, Feb 28 2024 3:19:19 pm EnsuredLoadBalancer Service anothernginx Ensured load balancer anothernginx.17b82a1c289ab8c7 Wed, Feb 28 2024 3:19:19 pm SyncLoadBalancerFailed Service anothernginx Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service default/anothernginx failed, error: Operation cannot be fulfilled on services "anothernginx": the object has been modified; please apply your changes to the latest version and try again anothernginx.17b82a1ac0e6db80 Wed, Feb 28 2024 3:19:12 pm

@noahgildersleeve @khushboo-rancher @TachunLin

khushboo-rancher commented 4 months ago

Error from events

SyncLoadBalancerFailed  Service anothernginx    Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service default/anothernginx failed, error: Operation cannot be fulfilled on services "anothernginx": the object has been modified; please apply your changes to the latest version and try again   anothernginx.17b82a1ac0e6db80   Wed, Feb 28 2024  3:19:12 pm
irishgordo commented 4 months ago

For Contrast, this is not reproducible on:

Screenshot from 2024-02-28 16-28-18

cc: @khushboo-rancher @bk201 @starbops

irishgordo commented 4 months ago

Additionally for contrast, not an issue in:

Screenshot from 2024-02-28 16-52-28

starbops commented 4 months ago

Since harvester-cloud-provider 0.2.0, we introduced kube-vip as a dependency. Without kube-vip running on the guest cluster, the underlying DHCP negotiation will not happen, and the LB type of Serivce will be stuck in the pending state.

$ kubectl get ds kube-vip -n kube-system
NAME       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                AGE
kube-vip   0         0         0       0            0           node-role.kubernetes.io/control-plane=true   137m

The kube-vip DaemonSet is there, but no Pods are scheduled due to the misconfigured nodeSelector. The Node object has the label named node-role.kubernetes.io/controlplane: "true":

    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/os: linux
      cattle.io/creator: norman
      kubernetes.io/arch: amd64
      kubernetes.io/hostname: v126-opensuse1
      kubernetes.io/os: linux
      node-role.kubernetes.io/controlplane: "true"
      node-role.kubernetes.io/etcd: "true"
      node-role.kubernetes.io/worker: "true"

To work around the issue, we need to manually patch the kube-vip DaemonSet by removing the - character for the nodeSelector to match the node's label. After that, the kube-vip Pods will be running on the cluster.

Note: This only happens to RKE1 guest clusters because the nodes' label is node-role.kubernetes.io/control-plane: "true", which could satisfy kube-vip's nodeSelector.

starbops commented 4 months ago

For users who want to install harvester-cloud-provider chart on the RKE1 cluster, you can provide the following values to unset the default nodeSelector and add a new one with the correct key:

# before
kube-vip:
  nodeSelector:
    node-role.kubernetes.io/control-plane: "true"

# after
kube-vip:
  nodeSelector:
    node-role.kubernetes.io/control-plane: null
    node-role.kubernetes.io/controlplane: "true"

image

w13915984028 commented 4 months ago

FYI, update the chart history:

Harvester-cloud-provider has 2 main releases: 0.1.14 and 0.2.3.

The old 0.1.14 is working with Harvester v1.2.1, but has no latest features & bug fixes.

The v0.2.3 is a minor fix to v0.2.2, its the target chart for Harvester v1.2.1, and the usage is docuement in: https://docs.harvesterhci.io/v1.2/rancher/cloud-provider

Harvester-cloud-provider releases:

https://github.com/harvester/charts/releases/tag/harvester-cloud-provider-0.1.14
github-actions released this Jan 11, 2023

https://github.com/harvester/charts/releases/tag/harvester-cloud-provider-0.2.2
github-actions released this Jun 16, 2023

https://github.com/harvester/charts/releases/tag/harvester-cloud-provider-0.2.3
github-actions released this Jan 15, 2024

Harvester v1.2.1 release:

https://github.com/harvester/harvester/releases/tag/v1.2.1
rancherio-gh-m released this Oct 26, 2023

Rancher-v2.7 chart index:


https://github.com/rancher/charts/blob/dev-v2.7/index.yaml
harvester-cloud-provider:
      catalog.cattle.io/upstream-version: 0.1.14
    apiVersion: v2
    appVersion: v0.1.5
    created: "2023-05-17T18:41:41.990313+08:00"

...
      catalog.cattle.io/upstream-version: 0.2.3
    apiVersion: v2
    appVersion: v0.2.0
    created: "2024-01-19T14:56:50.015207836+01:00"    
...
harvesterhci-io-github-bot commented 4 months ago

Pre Ready-For-Testing Checklist

khushboo-rancher commented 2 weeks ago

This is validated with Cloud provider 0.2.4 and rke1 1.27 as working.