rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.5k stars 219 forks source link

Rancher new cluster node registration failing #2053

Open P-n-I opened 8 months ago

P-n-I commented 8 months ago

Is there an existing issue for this?

Current Behavior

When trying to register a new node with a new downstream RKE2 cluster in Rancher 2.7.9 (also 2.7.5) we see the nodes plan Secret is never populated so the rancher-system-agent endlessly polls for a plan.

If we re-deploy the fleet-agent Deployment prior to creating the new downstream cluster definition in Rancher we can occasionally register nodes.

We have to re-deploy fleet-agent each time we need to create a new cluster, though this does not consistently work around the issue.

if the registration fails or we need to re-create the cluster we wipe the nodes, delete the cluster from Rancher and repeat the steps above.

From the fleet-controller logs when creating the downstream cluster named "test":

2024-01-09T14:14:27.714641430Z time="2024-01-09T14:14:27Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

The workaround of restarting the fleet-agent is not consistent, sometimes repeated manual loops of create cluster, register, delete cluster work.

Registration of nodes to k3s clusters appears to work, I've not tested that as much

Expected Behavior

We can create register nodes to newly created downstream clusters.

Steps To Reproduce

Environment

- Architecture: x86_64
- Fleet Version: 1.7.1 and 1.8.1
- Cluster:
  - Provider: rke2
  - Options:
  - Kubernetes Version: v1.26.11+rke2r1

Logs

Logs from fleet-agent after a restart followed by a failed node registration:

I0109 14:34:16.884697       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:34:20.761215643Z I0109 14:34:20.760567       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:34:21.514842587Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
2024-01-09T14:34:21.515239711Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:34:21.515651076Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:34:21.515921289Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:34:22.245467409Z E0109 14:34:22.245355       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
time="2024-01-09T14:34:22Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="getting history for release fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:23Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
time="2024-01-09T14:34:24Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:25Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Logs from fleet-agent after a restart, create new cluster and successful registration:

I0109 14:37:40.958163       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:37:44.767848536Z I0109 14:37:44.767654       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:37:45.799901278Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:37:45.799938559Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:37:45.799944609Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:37:45.799949489Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
E0109 14:37:45.966607       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
2024-01-09T14:37:45.991817525Z time="2024-01-09T14:37:45Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
2024-01-09T14:37:45.992046547Z time="2024-01-09T14:37:45Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:46.002690980Z time="2024-01-09T14:37:46Z" level=info msg="getting history for release fleet-agent-local"
2024-01-09T14:37:46.255440243Z time="2024-01-09T14:37:46Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:47.041131051Z time="2024-01-09T14:37:47Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
2024-01-09T14:37:48.276516222Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:48.527326573Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Anything else?

Ref https://github.com/rancher/rancher/issues/43901 specifically https://github.com/rancher/rancher/issues/43901#issuecomment-1881021356

P-n-I commented 8 months ago

From logs when creating the cluster in Rancher: fleet-agent

W0110 08:31:07.744207       1 reflector.go:442] pkg/mod/github.com/rancher/client-go@v0.24.0-fleet1/tools/cache/reflector.go:167: watch of *v1alpha1.BundleDeployment ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 7; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

fleet-controller

time="2024-01-10T08:31:02Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-dev-sandbox-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
P-n-I commented 8 months ago

Contents of the clusters mcc bundle Chart.yaml value:

annotations:
  catalog.cattle.io/certified: rancher
  catalog.cattle.io/hidden: "true"
  catalog.cattle.io/kube-version: '>= 1.23.0-0 < 1.27.0-0'
  catalog.cattle.io/namespace: cattle-system
  catalog.cattle.io/os: linux
  catalog.cattle.io/permits-os: linux,windows
  catalog.cattle.io/rancher-version: '>= 2.7.0-0 < 2.8.0-0'
  catalog.cattle.io/release-name: system-upgrade-controller
apiVersion: v1
appVersion: v0.11.0
description: General purpose controller to make system level updates to nodes.
home: https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
kubeVersion: '>= 1.23.0-0'
name: system-upgrade-controller
sources:
- https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
version: 102.1.0+up0.5.0

downstream cluster we're seeing the issue with is v1.26.11+rke2r1

P-n-I commented 7 months ago

debug log output from the fleet-controller when creating a new downstream cluster:


time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="shorten bundle name test-managed-system-agent to test-managed-system-agent"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 32.236433ms"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 183.411µs"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="shorten bundle name mcc-test-managed-system-upgrade-controller to mcc-test-managed-system-upgrade-controller"
time="2024-01-11T11:08:10Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 5.27411ms"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 289.752µs"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"
P-n-I commented 7 months ago

I don't know golang at all but I've been digging around trying to find out if there's something in our clusters that's wrong.

tag: release/v0.8.1+security1

controller.OnBundleChange controller.setResourceKey helmdeployer.Template (sets Helm defaults including useGlobalCfg: true and globalCfg.Capabilities to chartutil.DefaultCapabilities) Helm.Deploy Helm.install Helm.getCfg (if useGlobalCfg return globalCfg)

so at this point it's using the globalCfg which has default (1.20.0) as the kubeversion in Capabilities therefore Helm.install doesn't execute the cfg.RESTClientGetter.ToRESTMapper()

Can't find useGlobalCfg getting set anywhere other than true in Template so: I think it's unset when called via agent.manager so is the bool default: false

P-n-I commented 7 months ago

I've done some local hacking of the code to add some logging and changed the fleet-controller Deployment to use our in-house hacked version.

this is the output when creating a new cluster called bobbins:

time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller calling setResourceKey"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 in setResourceKey mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template patched with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template calling Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install for bundle mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install cfg kubeversion v1.20.0"
time="2024-01-12T15:22:34Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-bobbins-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
P-n-I commented 7 months ago

I hacked the fleet code to work round the "chart requires kubeVersion: >= 1.23.0-0" issue, created a new fleet-controller container updated the fleet-controller deployment to run my hacked container on our dev cluster and its made no difference to the problem of the machine-plan Secret not being populated with data.

That unrelated kubeVersion issue relates to the bundle mcc-<cluster>-managed-system-upgrade-controller.

The issue remains that the nodes custom-<id>-machine-plan Secret doesn't get populated so rancher-system-agent endlessly polls Rancher.

P-n-I commented 7 months ago

rancher-system-agent output with CATTLE_AGENT_LOGLEVEL=debug

Jan 17 15:34:23 packer systemd[1]: Started Rancher System Agent.
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=debug msg="Instantiated new image utility with imagesDir: /var/lib/rancher/agent/images, imageCredentialProviderConfig: /var/lib/rancher/credentialprovider/config.yaml, imageCredentialProviderBinDir: /var/lib/rancher/credentialprovider/bin, agentRegistriesFile: /etc/rancher/agent/registries.yaml"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Starting remote watch of plans"
Jan 17 15:34:27 packer rancher-system-agent[18569]: E0117 15:34:27.619141   18569 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=info msg="Starting /v1, Kind=Secret controller"
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=debug msg="[K8s] Processing secret custom-aede8c2b641f-machine-plan in namespace fleet-default at generation 0 with resource version 48393246"

and

k -n fleet-default get secret custom-aede8c2b641f-machine-plan
NAME                               TYPE                         DATA   AGE
custom-aede8c2b641f-machine-plan   rke.cattle.io/machine-plan   0      101s
rgomez-eng commented 7 months ago

I'm having the exact same issue. Is there any workaround to get past this issue? Or maybe any specific version to use?

P-n-I commented 7 months ago

I've not found a workaround, sometimes a registration works but mostly is stuck on the empty machine-plan for us.

P-n-I commented 7 months ago

@rgomez-eng a long shot but; are you registering the node(s) with all three roles or have the problematic nodes got a sub-set of the etcd, controlplane and worker roles? Check the logs from the Rancher pods for occurences of

[INFO] [planner] rkecluster fleet-default/<CLUSTER NAME>: waiting for at least one control plane, etcd, and worker node to be registered

As it implies the node(s) with one of those roles isn't registered. Until each of the three roles is fulfilled by at least one registered node the cluster is not considered 'sane' and no node plan is delivered, therefore the rancher-system-agent endlessly polls for the plan Secret.

P-n-I commented 7 months ago

We're not able to reliably re-create the issue and don't have the time to investigate further. Sometimes it just takes a while for the plan Secret to populate even though we have 6 nodes (3*etc/control, 3 worker) wanting to join.

P-n-I commented 6 months ago

We're still seeing this issue:

fleet-default                                    custom-2a6a4339a879-machine-plan                           rke.cattle.io/machine-plan            0  14m
fleet-default                                    custom-638e801c6183-machine-plan                           rke.cattle.io/machine-plan            0  14m
fleet-default                                    custom-60117bd68fdd-machine-plan                           rke.cattle.io/machine-plan            0  15m
fleet-default                                    custom-4623c9642380-machine-plan                           rke.cattle.io/machine-plan            0  16m
fleet-default                                    custom-6aa4c775aee0-machine-plan                           rke.cattle.io/machine-plan            0  16m
fleet-default                                    custom-a8dd42fd6fcc-machine-plan                           rke.cattle.io/machine-plan            0  16m

image

P-n-I commented 6 months ago

We use ansible to register the nodes (pull the registration cmd from the rancher api and run it on each node).

The first set of nodes to get registered are the ones with control and etcd roles. After they're registered we register worker nodes.

Rancher won't, by design, populate the machine-plans for the nodes until at least one node of all types is registered.

I tried manually registering 3 nodes that had all three roles but still see the machine-plan having 0 bytes.

kkaempf commented 6 months ago

Does it still happen with Rancher 2.8.1 ?

P-n-I commented 6 months ago

I've just upgraded to 2.8.2 and wanted to re-create the node in a single-node k3s cluster. I deleted the node from the downstream cluster when on 2.7.9. Got notified of your comment so upgraded to 2.8.2. Joining the node to the cluster is still stuck on and empty machine-plan Secret.

fleet-default                                    custom-624a3f13e536-machine-plan                           rke.cattle.io/machine-plan            0      8m32s

provisioning log:

[INFO ] waiting for infrastructure ready
[INFO ] waiting for at least one control plane, etcd, and worker node to be registered
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for bootstrap etcd to be available
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-scheduler
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] non-ready bootstrap machine(s) custom-624a3f13e536 and join url to be available on bootstrap node
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] marking control plane as initialized and ready
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: Node condition Ready is False., waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for kubelet to update
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.
P-n-I commented 6 months ago

Rancher 1.8.2, created a new k3s cluster, registered one node to it. Destroyed cluster and re-created with same name, ran k3s-uninstall on the node and joined it to the new cluster and we see, what looks like the old node from the first cluster attempt in the UI: image Note; the age of the cluster and working on v's the age of the node in error.

kfehrenbach commented 5 months ago

We have an similar issue caused by an empty machine-plan for the new nodes in a new cluster. A workaround that helped was this:

  1. Run command for joining 1st Master (don't wait and get to the second step)
  2. Run command for joining 1st Worker. You will see 1st Master change his status from WaitingNodeRef
  3. Run command on 2nd and 3rd masters. After that cattle-cluster agents will come up and Worker1 change its status from WaitingNodeRef
  4. Join other workers

Suddenly the plan for master-1 was populated and the cluster bootstrapping started. We absolutely have no idea why the hell this works... Note: Fleet-Agent is in version 0.81

P-n-I commented 1 month ago

We might have found a cause in our env: we uppped the timout on the the load balancer we have in front of the nodes running Rancher as we thought it was probably killing the rancher-system-agent watch of the websocket.