rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.22k stars 2.94k forks source link

[BUG] managed cluster lost on migrating Rancher from rke to rke2 #40322

Closed harridu closed 7 months ago

harridu commented 1 year ago

Rancher Server Setup

Describe the bug After migrating my rancher cluster from RKE to RKE2 (using the restore mechanism as described) all managed clusters based on RKE are back, but my one-and-only managed cluster ("kube003") based on RKE2 v1.24.8+rke2r1 is stuck in Updating. Error message is

Configuring bootstrap node(s) custom-7e6e3f6464e0: waiting for plan to be applied

In the cluster manager for kube003 is a message

the server has asked for the client to provide credentials

All clusters are on-premises. The common hostname has been modified in DNS (TTL is 300secs) to point to the new hosts, as requested in the migration guide. It was set on the helm install command line on the new rancher cluster. The native host names of the old cluster nodes have not been preserved.

brandond commented 1 year ago

After migrating my rancher cluster from RKE to RKE2 (using the restore mechanism as described)

Just to clarify - did you install a new RKE2 cluster and migrate the Rancher application over via backup/restore, or did you in-place upgrade your cluster from RKE to RKE2? In-place migrations of a cluster from RKE to RKE2 are not currently supported (should probably work but still highly experimental and not supported) so I wanted to confirm which scenario we're dealing with.

snasovich commented 1 year ago

@harridu , adding to the above, and assuming it's the Rancher itself that was migrated from RKE1 to RKE2, could you please clarify what exact restore procedure was used, is it https://ranchermanager.docs.rancher.com/v2.6/how-to-guides/new-user-guides/backup-restore-and-disaster-recovery/migrate-rancher-to-new-cluster

using the restore mechanism as described

harridu commented 1 year ago

I have set up new hosts rancher01{a..c} using Debian 11 and rke2 v1.24.8+rke2r1. The old hosts (rr0{1..3}) were based upon Debian 11 and rke v1.24.8. All hosts are virtual hosts (qemu, libvirt, ...) with 4 cores and 8 GByte RAM.

Migration has been done using the backup helm charts and the restore object (and local S3 storage on minio), as described in the documentation.

CHART_VERSION=2.1.2
helm install rancher-backup-crd rancher-charts/rancher-backup-crd -n cattle-resources-system --create-namespace --version $CHART_VERSION
helm install rancher-backup rancher-charts/rancher-backup -n cattle-resources-system --version $CHART_VERSION

First the restore operator, the S3 secret and the certificate have been created. kubectl describe did not indicate any problems with the restore object.

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: restore-migration
spec:
  backupFilename: hourly-95f76381-d9e5-4644-af1b-ca8fdf93a7ae-2023-01-23T10-56-59Z.tar.gz
  prune: false
  storageLocation:
    s3:
      bucketName: rancher01
      credentialSecretName: rancher-backup-s3
      credentialSecretNamespace: cattle-resources-system
      endpoint: minio.example.com:9010
      folder: backup

Next rancher was installed using

helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher01.example.com --set ingress.tls.source=secret --version 2.6.9
bpedersen2 commented 1 year ago

If you changed the rancher hostname, then maybe also check https://www.suse.com/support/kb/doc/?id=000020173

harridu commented 1 year ago

I did not change the rancher hostname ("rancher01.example.com"), just the names of hosts in the cluster. RKE2 was setup with

server: https://rancher01a.example.com:9345
token: look it up on the first node
tls-san:
  - rancher01.example.com
  - rancher01a.example.com
  - rancher01b.example.com
  - rancher01c.example.com

rancher01 is a round-robin in DNS pointing to the IP addresses of rancher01{a..c}

Did I mention that 4 clusters running on RKE made the migration without warning? Just the cluster based on RKE2 shows "Updating" or "Reconciling" and a message "Waiting for plan to be applied" in the Rancher Web GUI.

harridu commented 1 year ago

PS, this might be helpful: The fleet-agent on kube003 seems to have a problem:

% k logs -n cattle-fleet-system fleet-agent-bfc5655cc-6wvgx
time="2023-01-27T15:22:50Z" level=error msg="Current credential failed, failing back to reregistering: Unauthorized"
time="2023-01-27T15:22:50Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-system/fleet-agent-bootstrap or cattle-fleet-system/fleet-agent: looking up secret cattle-fleet-system/fleet-agent-bootstrap: secrets \"fleet-agent-bootstrap\" not found"
time="2023-01-27T15:23:50Z" level=error msg="Current credential failed, failing back to reregistering: Unauthorized"
time="2023-01-27T15:23:50Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-system/fleet-agent-bootstrap or cattle-fleet-system/fleet-agent: looking up secret cattle-fleet-system/fleet-agent-bootstrap: secrets \"fleet-agent-bootstrap\" not found"
time="2023-01-27T15:24:50Z" level=error msg="Current credential failed, failing back to reregistering: Unauthorized"
time="2023-01-27T15:24:50Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-system/fleet-agent-bootstrap or cattle-fleet-system/fleet-agent: looking up secret cattle-fleet-system/fleet-agent-bootstrap: secrets \"fleet-agent-bootstrap\" not found"
ron1 commented 1 year ago

@harridu I see this same error with a downstream RKE2 custom cluster when I migrate my Rancher management cluster from one RKE2 cluster to another RKE2 cluster. This issue seems similar to issue https://github.com/rancher/rancher/issues/40080. However, the proposed fix for that issue is not working for me. Are you able to apply the fix for that issue here and confirm it does not solve this problem either?

harridu commented 1 year ago

Hi @ron1 , thank you for the pointer.

If I got this correctly, my new Rancher is in a highly questionable state, regardless whether there is a fix in #40080. Internal data has been corrupted by the broken backup/restore mechanism. The migration failed, ie. I have to setup a new Rancher to manage my clusters. Since there is no migration procedure for the managed clusters I have to rebuild them as well.

Seems I wasted a lot of time in trying the migration.

harridu commented 7 months ago

closed due to lack of interest