Migration / Upgrade from RKE to RKE2

davidnuzik commented 3 years ago

This issue shall be used to track the task of researching our options for a migration or upgrade path from rancher/rke to rancher/rke2.

cjellick commented 3 years ago

This is really disconnected from our release cycle. At a later date, we need to decide how it should be tracked against milestones.

immanuelfodor commented 3 years ago

Apart from the meta issue of release cycle connection, the actual RKE->RKE2 migration path work is being done in the meantime, right? This is what interests everybody mainly, it would be great to have an update how things are going.

brandond commented 3 years ago

Follow along: https://github.com/galal-hussein/migration-agent

pacholoamit commented 2 years ago

Any update on the Migration tool? I currently have a rancher provisioned rke1 cluster, would it be possible to upgrade the cluster to rke2?

immanuelfodor commented 2 years ago

Any news?

atsai1220 commented 2 years ago

Any news?

Found this in the documentation. Hope it’s helpful. https://docs.rke2.io/migration/

ArthurMcTool commented 2 years ago

The migration document is somewhat helpful. However the image for the migration-agent tool as referenced in the manifest file does not appear to exist. I opened the following issue: https://github.com/rancher/migration-agent/issues/15

Is there somewhere else to get the migration-agent binary?

travisghansen commented 2 years ago

Success for me (with some headaches..as expected). I took a slightly different approach and downloaded the binary directly to the nodes to run the agent.

Rough steps for me:

create etcd backup using rke1
download agent to all nodes directly (no container)
run migration agent on controller node(s)

./migration-agent-amd64 <s3 options> --snapshot rke1snapshot.zip \
  --node-name node01 \
  --kubeconfig /etc/kubernetes/ssl/kubecfg-kube-controller-manager.yaml

run migration agent on worker node(s)

./migration-agent-amd64 <s3 options> --snapshot rke1snapshot.zip \
  --disable-etcd-restore \
  --node-name node02 \
  --kubeconfig /etc/kubernetes/ssl/kubecfg-kube-node.yaml

review the various files in the manifests directory and the /etc/rancher/rke2/config.yaml.d/10-migration.yaml file as appropriate
place proper /etc/rancher/rke2/config.yaml on nodes per their roles
place proper /etc/rancher/rke2/registries.yaml on nodes (for registry mirrors)
install rke2 on all the nodes
disable/delete all rke2 addons until the migration manifest/job has successfully deployed
stop docker on controller, start rke2-server service
wait for things to settle
stop docker on workers, start rke2-agent service
wait for things to settle
disable docker entirely on all nodes

I ran into the following issues:

migration-agent-addons-remove.yaml had a couple issues
- apiVersion: rbac.authorization.k8s.io/v1beta1 is not valid (on newer k8s versions), so the yaml failed entirely
- in my case I did not install ingress controller or cni via rke1 so it referenced invalid configmaps (which prevented the job from running)

I have the following remaining questions:

what other cleanup can be done for rke1?
- is /etc/kubernetes/... still relevant?
- can I delete /var/lib/etcd/
- can I delete /var/lib/rancher/rke/
- what assets in the cluster can be wiped out now? old configmaps? old jobs? old secrets?
what roles are relevant now and how do I clean them up control-plane,controlplane,etcd,master,worker?

For a relatively complex rke install (almost 4 years old, custom cni, custom ingress, custom kubelet args/env, etc), the process went quite smooth IMO. Especially for someone with no prior experience with rke2.

ArthurMcTool commented 2 years ago

That's great thanks for that @travisghansen. But to my immediate question and your 2nd bullet point - if you could share with me where I can download 'migration-agent-amd64', I would really appreciate it....

thijsa commented 2 years ago

@ArthurMcTool You can download the latest binary from https://github.com/rancher/migration-agent/releases/latest and the image husseingalal/migration-agent:dev from the manifest exists for me, see https://hub.docker.com/r/husseingalal/migration-agent/tags.

ArthurMcTool commented 2 years ago

Awesome, thanks!

ArthurMcTool commented 2 years ago

I have an existing RKE1 cluster with three Master Servers and three worker (Agent) nodes. I've followed @travisghansen steps almost to a tee. The end game, once I upgrade all three Master Servers is that depending on which master server I point my kube config to, that server only sees itself as having been upgraded and shows the others as NotReady and at the RKE1 version.

Examples:

NAME      STATUS     ROLES                                    AGE   VERSION
server1   Ready      control-plane,controlplane,etcd,master   72m   v1.22.7+rke2r1
server2   NotReady   controlplane,etcd                        72m   v1.21.5
server3   NotReady   controlplane,etcd                        72m   v1.21.5

NAME      STATUS     ROLES                                    AGE   VERSION
server1   NotReady   controlplane,etcd                        72m   v1.21.5
server2   Ready      control-plane,controlplane,etcd,master   72m   v1.22.7+rke2r1
server3   NotReady   controlplane,etcd                        72m   v1.21.5

NAME      STATUS     ROLES                                    AGE   VERSION
server1   NotReady   controlplane,etcd                        72m   v1.21.5
server2   NotReady   controlplane,etcd                        72m   v1.21.5
server3   Ready      control-plane,controlplane,etcd,master   72m   v1.22.7+rke2r1

I have a different config.yaml for my primary server, and another one for the secondary servers.

Primary config.yaml:

token: "my-token"
cluster-domain: "my-rancher-domain.com"
tls-san: "kubeapi.my-rancher-domain.com"

Secondary config.yaml:

server: "https://server1:9345"
token: "my-token"
cluster-domain: "my-rancher-domain.com"
tls-san: "kubeapi.my-rancher-domain.com"

I've setup a VIP and Loadbalancer for ports 6443/9345 (kubeapi.my-rancher-domain.com), I've used that in my config.yaml file when the 'server1:9345' setting didn't seem to work.

It seems like when I run rke2-server on a master, it gets upgraded but knows little if anything about the rest of the cluster. What am I missing here?

travisghansen commented 2 years ago

I haven’t attempted the migration with a multi controller cluster yet but I would probably only restore the db on the very first controller, and then join the additional controllers to that and let them sync the etcd db as if joining a brand new node.

Maybe you did that, if so ignore the comment :)

ArthurMcTool commented 2 years ago

Hey guys, so I was able to get a cluster (3 Masters, 3 Workers) sort of migrated from 1.21.7 to v1.22.7+rke2r1 by following the above suggestion of @travisghansen to only restore the etcd database in the first master. But I've come across some other issues.

A couple peculiarities with the workers:

I had to drain and remove the existing nodes one at a time and upgrade them other wise I wound up with two entries for each worker node, one with the domain name and one without. Not sure why the domain name is omitted, I guess I'll have to rename them...

Also noticed that the workers were not labelled (workers) so I had to manually label them (kubectl label node worker01 node-role.kubernetes.io/worker=true)

NAME                                STATUS   ROLES                                    AGE    VERSION
master01.mydomain.com       Ready    control-plane,controlplane,etcd,master   127m   v1.22.7+rke2r1
master02.mydomain.com       Ready    control-plane,controlplane,etcd,master   127m   v1.22.7+rke2r1
master03.mydomain.com       Ready    control-plane,controlplane,etcd,master   127m   v1.22.7+rke2r1
worker01                    Ready                                             26m    v1.22.7+rke2r1
worker02                    Ready                                             25m    v1.22.7+rke2r1
worker03                    Ready                                             21m    v1.22.7+rke2r1

Then I also have a bunch of pods that are stuck pending:

k describe pods -n kube-system helm-install-rke2-canal--1-zm4hh:

Type     Reason            Age                  From               Message
----     ------            ----                 ----               -------
Warning  FailedScheduling  12m                  default-scheduler  0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate.

And another one: k describe pods -n ingress-nginx nginx-ingress-controller-x2rtp:

Type     Reason            Age                    From               Message
----     ------            ----                   ----               -------
Warning  FailedScheduling  8m52s                  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.

I'll work on uploading my notes for my manual migration here to this chat. They are an extension of what Travis provided, but quote a difference from those on the migration page.

If anyone has any suggestions - they would be appreciated...

travisghansen commented 2 years ago

Those errors seem pretty odd given the circumstances. It almost appears as if it's the old rke1 addons (which should have been removed).

For example, the controlplane role was an rke1-ism...in rke2 it's control-plane. On a fresh install of rke2 the ingress-nginx deploy also appears to be going to kube-system ns not ingress-nginx.

My rke1 installs had most of the addons disabled anyway which eliminated some of these issues for sure. For what it's worth on a fresh rke2 install the roles for the nodes are:

server: control-plane,etcd,master
agent: <none> (as you might expect, no labels are added in this case)

ArthurMcTool commented 2 years ago

Thanks @travisghansen, we don't use the addons in our RKE1 clusters. For instance, k get addons -A returns nothing. During the migration (rke2-server startup) I do see it executing the manifest to remove addons but it provides a warning or error that it can't find the addons file in /etc/rke_addon/... Is the migration tool still in development? Has anyone been able to successfully migrate a cluster - something with multiple masters and workers?

thanks..

(Curious) given the output in my previous post, what kind of state or condition is my cluster actually in? How can I manage these pending pods?

travisghansen commented 2 years ago

I'm referring to this (not some sort of resource called addons):

ArthurMcTool commented 2 years ago

Hello again,

So I've added the following to my /etc/rancher/rke2/config.yaml in order to disable addons:

disable: rke2-canal
disable: rke2-canal-config
disable: rke2-coredns
disable: rke2-coredns-config
disable: rke2-ingress-nginx
disable: rke2-ingress-nginx-config
disable: rke2-metrics-server
disable: rke2-metrics-server-config

I did all the other things mentioned earlier in order to migrate, then I ran rke2-server on the first master node and still got stuck with all this:

k get pods -A | grep -vi running | grep -vi completed
NAMESPACE       NAME                                                   READY   STATUS             RESTARTS        AGE
kube-system     helm-install-rke2-canal--1-vljg2                       0/1     CrashLoopBackOff   13 (88s ago)    21m
kube-system     helm-install-rke2-coredns--1-fmwnh                     0/1     CrashLoopBackOff   13 (91s ago)    21m
kube-system     helm-install-rke2-ingress-nginx--1-zz9qb               0/1     Terminating        0               21m
kube-system     helm-install-rke2-metrics-server--1-p95jn              0/1     CrashLoopBackOff   6 (79s ago)     7m28s
kube-system     rke2-ingress-nginx-controller-d6hrx                    0/1     Pending            0               15m

I can't see how this migration tool is ready and our organisation has a lot invested in Rancher and we need a migration path. Can any of the developers comment on the status of this migration tool project, what has been tested, what works/doesn't etc?

thanks...

katran001 commented 2 years ago

@galal-hussein anything additional we need to add to this before closing it?

travisghansen commented 2 years ago

The tool has a lot of rough edges IMO to be considered ready for business use. Perhaps it’s as good as it’s going to get though.

jcwimer commented 2 years ago

I found a bug in the migration tool for kubelet args:

The following configuration from my rke1 cluster.yaml

services:
  kubelet:
    extra_args:
      resolv-conf: "/run/resolvconf/resolv.conf" # for systemd-resolvd
      max-pods: 150
      enforce-node-allocatable: "pods"
      system-reserved: "cpu=150m,memory=150Mi,ephemeral-storage=1Gi"
      kube-reserved: "cpu=150m,memory=150Mi,ephemeral-storage=1Gi"
      eviction-hard: "memory.available<500Mi,nodefs.available<10%"

Turns into

"kubelet-arg":"enforce-node-allocatable=pods,eviction-hard=memory.available\u003c500Mi,nodefs.available\u003c10%,kube-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi,max-pods=150,resolv-conf=/run/resolvconf/resolv.conf,system-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi"

But I had to modify it to be:

"kubelet-arg":["enforce-node-allocatable=pods","eviction-hard=memory.available<500Mi,nodefs.available<10%","kube-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi","max-pods=150","resolv-conf=/run/resolvconf/resolv.conf","system-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi"]

It did not make the options an array and it parsed eviction-hard=memory.available<500Mi,nodefs.available<10% to eviction-hard=memory.available\u003c500Mi,nodefs.available\u003c10%

jcwimer commented 2 years ago

After fixing that, I'm getting the following error:

Jun 23 10:12:12 k8s-master1 rke2[8709]: time="2022-06-23T10:12:12-04:00" level=info msg="Reconciling bootstrap data between datastore and disk"
Jun 23 10:12:12 k8s-master1 rke2[8709]: time="2022-06-23T10:12:12-04:00" level=fatal msg="Failed to reconcile with temporary etcd: bootstrap data already found and encrypted with different token"

I have an 8 worker cluster with 3 controllers. I start by stopping docker on the 3 controllers. Then on the first controller, running the migration tool, fixing the config error above, and starting rke2. Eventually, I run into this issue every time.

ArthurMcTool commented 2 years ago

I had meant to post this earlier, hopefully it will help someone...

Migrate RKE1 Cluster to RKE2

This ticket will be used to upgrade our three master, 7 worker (Agent) cluster from:

RKE1 to RKE2
Kubernetes 1.21.5 to v1.22.7+rke2r1

Overview:

It is unfortunate that the migration tool only copies a few config files. The bulk of the heavy lifting has to be done manually. I've run through this 100 or more times in order to find all the little things that we needed to do to successfully migrate our clusters.

There are many things (crd's, service accounts, certain deployments) that need to be removed from your RKE1 cluster first - otherwise the migration fails.

Notes:

Nodes are running Ubuntu20
When running the migration-agent and it says 'migration successful', don't be fooled all it did was migrate some config files - your work is just beginning.
Assumes you have a VIP for your masters (ports 6443 and 9345)
You are using iptables and not firewalld
You know and have the version of RKE1 and your cluster config.yaml that was used to initially create your rke1 cluster.

RKE Migration Steps:

Remove all this from the RKE1 Cluster:

k delete daemonset -n kube-system canal

k delete crd bgpconfigurations.crd.projectcalico.org
k delete crd bgppeers.crd.projectcalico.org
k delete crd blockaffinities.crd.projectcalico.org
k delete crd clusterinformations.crd.projectcalico.org
k delete crd felixconfigurations.crd.projectcalico.org
k delete crd globalnetworkpolicies.crd.projectcalico.org
k delete crd globalnetworksets.crd.projectcalico.org
k delete crd hostendpoints.crd.projectcalico.org
k delete crd ipamblocks.crd.projectcalico.org
k delete crd ipamconfigs.crd.projectcalico.org
k delete crd ipamhandles.crd.projectcalico.org
k delete crd ippools.crd.projectcalico.org
k delete crd kubecontrollersconfigurations.crd.projectcalico.org
k delete crd networkpolicies.crd.projectcalico.org
k delete crd networksets.crd.projectcalico.org

k delete deployments -n kube-system  calico-kube-controllers
k delete deployments -n kube-system coredns
k delete deployments -n kube-system coredns-autoscaler
k delete deployments -n kube-system metrics-server

### CLUSTER SPECIFIC:
k delete pods -n kube-system rke-coredns-addon-deploy-job-4chs4
k delete pods -n kube-system rke-ingress-controller-deploy-job-xqw8f
k delete pods -n kube-system rke-metrics-addon-deploy-job-rf78v
k delete pods -n kube-system rke-network-plugin-deploy-job-9hjtz

k delete serviceaccounts -n kube-system       canal 
k delete serviceaccounts -n kube-system  coredns
k delete clusterrole flannel
k delete clusterrolebinding canal-flannel
k delete clusterrolebinding canal-calico
k delete apiservice v1beta1.metrics.k8s.io

k delete ns ingress-nginx

From your workstation and using the correct version of RKE1 and kube config for this cluster, create the etcd backup:
```
../rke_1.3.1 etcd snapshot-save --config my-cluster.yml --name my-cluster
```

Download migration-agent to all cluster nodes directly:

wget -O /usr/local/bin/migration-agent-amd64 http://internal-server/artifacts/RKE/migration-agent-amd64
chmod 755 /usr/local/bin/migration-agent-amd64

Run migration agent on primary master node only:

/usr/local/bin/migration-agent-amd64 --snapshot /opt/rke/etcd-snapshots/my-cluster.zip \
--node-name master1.mydomain.com \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-controller-manager.yaml

Run migration agent on all secondary master nodes individually (change the node name as you go):

/usr/local/bin/migration-agent-amd64 --snapshot /opt/rke/etcd-snapshots/my-cluster.zip \
--disable-etcd-restore \
--node-name master[2,3].mydomain.com \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-controller-manager.yaml

If there are any workers, on the primary master node temporarily modify the permissions on the backup file:
```
chmod 644 /opt/rke/etcd-snapshots/my-cluster.zip
```
If there are any workers, create the snapshot directory on them and then scp the backup file from the primary master to each of the workers:
```
mkdir -p /opt/rke/etcd-snapshots
scp ${user}@master1:/opt/rke/etcd-snapshots/my-cluster.zip /opt/rke/etcd-snapshots/
```

On ALL nodes:

chmod 600 /opt/rke/etcd-snapshots/*
ls -lah /opt/rke/etcd-snapshots/

Run the migration agent on each of the workers individually (change the node name as you go):

/usr/local/bin/migration-agent-amd64 --snapshot /opt/rke/etcd-snapshots/my-cluster.zip \
--disable-etcd-restore \
--node-name worker[1,2,3,4,5,6,7].mydomain.com \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-node.yaml

Add additional firewall rules to all nodes if the node is using iptables and not firewalld:

iptables -D INPUT -j LOG_DROP
iptables -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 5473 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9098 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9345 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 30000:32767 -j ACCEPT
iptables -A INPUT -p udp -m udp --dport 51820 -j ACCEPT
iptables -A INPUT -p udp -m udp --dport 51821 -j ACCEPT
iptables -A INPUT -j LOG_DROP

wget -O /etc/iptables/rules.v4 http://internal-server/artifacts/rules.v4

Create the rancher config directory on all nodes:
```
mkdir -p /etc/rancher/rke2/
```

Get the RKE2 Binary for all nodes:

TMP_TARBALL="/tmp/rke2-1.22.7+rke2r1.tar.gz"
wget -O /tmp/rke2-1.22.7+rke2r1.tar.gz http://internal-server/artifacts/RKE/rke2-1.22.7+rke2r1.tar.gz
tar xzf "${TMP_TARBALL}" -C /usr/local
mv -f /usr/local/lib/systemd/system/rke2-*.service /etc/systemd/system/
systemctl daemon-reload

Edit the rancher config.yaml on the primary master node. (NOTE ${TOKEN} - create your own secret token):

echo -e "token: \"${TOKEN}\"\ncluster-domain: \"my-cluster-domain.com\"\ntls-san: \"kubeapi.my-cluster-domain.com\"" > /etc/rancher/rke2/config.yaml

Edit the rancher config.yaml on all secondary master nodes:

echo -e "server: \"https://kubeapi.my-cluster-domain.com:9345\"\ntoken: \"${TOKEN}\"\ncluster-domain: \"my-cluster-domain.com\"\ntls-san: \"kubeapi.my-cluster-domain.com\"" > /etc/rancher/rke2/config.yaml

Edit the rancher config.yaml on all worker/agent nodes:

echo -e "server: \"https://kubeapi.my-cluster-domain.com:9345\"\ntoken: \"${TOKEN}\"\ncluster-domain: \"my-cluster-domain.com\"\ntls-san: \"kubeapi.my-cluster-domain.com\"\nnode-name: \"`hostname -A | cut -d " " -f1`\"" > /etc/rancher/rke2/config.yaml

Check the config files:
```
cat /etc/rancher/rke2/config.yaml
```
On the master nodes, edit /etc/rancher/rke2/config.yaml.d/10-migration.yaml and change the kube-dns ip as follows. Yours might be slightly different, the point is to change the host portion of the ip to something different than the original:
```
sed -i 's/10.233.30.10/10.233.30.11/g' /etc/rancher/rke2/config.yaml.d/10-migration.yaml
```

Start the upgrade on the primary master:

systemctl disable docker
systemctl disable docker.socket
systemctl stop docker
systemctl stop docker.socket
reboot
rm -Rf /var/lib/docker
systemctl enable rke2-server.service
systemctl start rke2-server.service

You might need to change the server line in your kube config to point to the IP address of the primary master. Once the upgrade is complete you can change it to the VIP IP.

Start the upgrade on all secondary masters individually, then move on to the next node:

systemctl disable docker
systemctl disable docker.socket
systemctl stop docker
systemctl stop docker.socket
reboot
rm -Rf /var/lib/docker
systemctl enable rke2-server.service
systemctl start rke2-server.service

Remove the old controlplane label from the master nodes:

kubectl label node master1.mydomain.com node-role.kubernetes.io/controlplane-
kubectl label node master2.mydomain.com node-role.kubernetes.io/controlplane-
kubectl label node master3.mydomain.com node-role.kubernetes.io/controlplane-

let it smoke for a while...

To get rke-canal and rke-coredns install job pods to run, remove these taints from a master node:

kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd-
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane-

Once the kube is 99% stable, re-apply the taints:

kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd=true:NoExecute
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane=true:NoSchedule

Once the cluster is on a good state, start the upgrade on worker nodes individually. If your agent nodes are also your masters * then skip this step:

systemctl disable docker
systemctl disable docker.socket
systemctl stop docker
systemctl stop docker.socket
reboot
rm -Rf /var/lib/docker
systemctl enable rke2-agent.service
systemctl start rke2-agent.service

Get a list of completed jobs to delete (if any):

k get pods -A | grep 0/ | awk '{$1=$1};1' | cut -d " " -f 1,2

Create a backup tgz file of the /etc/rancher/rke2 directory from the primary master:

CLUSTERBASE=my-cluster
KUBEAPIURL="kubeapi.my-cluster-domain.com"
cp /etc/rancher/rke2/rke2.yaml /etc/rancher/rke2/kube_config 
sed -i "s/127.0.0.1/${KUBEAPIURL}/g" /etc/rancher/rke2/kube_config
tar -cvzf /etc/rancher/rke2/$CLUSTERBASE-cluster.tgz -C /etc/rancher/rke2 config.yaml kube_config config.yaml.d

SCP that file to your local workstation and safely store.

jcwimer commented 2 years ago

@ArthurMcTool Thanks for your detail! I'll give this a try.

Almost seems as though I should just remove the rke1 items like you talked about, take a velero backup, and restore to a rke2 cluster.

moophlo commented 2 years ago

I guess I've done wrong the "(NOTE ${TOKEN} - create your own secret token)" part as I've the master node ready and both the secondary masters "Not Ready"

NAME                    STATUS     ROLES                       AGE    VERSION
docker01.xxx.it   Ready      worker                      329d   v1.22.9
docker02.xxx.it   Ready      worker                      329d   v1.22.9
docker03.xxx.it   Ready      worker                      329d   v1.22.9
docker04.xxx.it   NotReady   etcd                        129d   v1.22.9
docker05.xxx.it   Ready      control-plane,etcd,master   129d   v1.23.8+rke2r1
docker06.xxx.it   NotReady   etcd                        129d   v1.22.9

moophlo commented 2 years ago

Ok I figured it out, I've missed an argument in the migration command (--disable-etcd-restore). Now this is the situation:

NAME                    STATUS   ROLES                       AGE    VERSION
docker01.xxx.it   Ready    worker                      20m    v1.23.8+rke2r1
docker02.xxx.it   Ready    worker                      330d   v1.23.8+rke2r1
docker03.xxx.it   Ready    worker                      330d   v1.23.8+rke2r1
docker04.xxx.it   Ready    control-plane,etcd,master   129d   v1.23.8+rke2r1
docker05.xxx.it   Ready    control-plane,etcd,master   129d   v1.23.8+rke2r1
docker06.xxx.it   Ready    control-plane,etcd,master   129d   v1.23.8+rke2r1

But now the problem is that almost nothing is starting properly:

NAMESPACE                   NAME                                             READY   STATUS                   RESTARTS         AGE
cattle-fleet-local-system   fleet-agent-6fc847b9dd-kclhl                     0/1     ContainerCreating        0                26m
cattle-fleet-system         fleet-controller-7bbcb965f9-mwpwx                0/1     ContainerCreating        0                26m
cattle-fleet-system         gitjob-55448cdfd7-8wnp2                          0/1     ContainerCreating        0                26m
cattle-resources-system     rancher-backup-6968b9cb8f-bk56w                  0/1     ContainerCreating        0                18h
cattle-system               helm-operation-l6mks                             0/2     Completed                1                22h
cattle-system               rancher-5677f59677-4x4b6                         0/1     ContainerCreating        0                26m
cattle-system               rancher-5677f59677-ddzrh                         0/1     Error                    2                45h
cattle-system               rancher-5677f59677-qvmt5                         0/1     ContainerCreating        2                40h
cattle-system               rancher-webhook-675bccfc59-cp78h                 0/1     ContainerCreating        0                26m
cert-manager                cert-manager-6d6bb4f487-xkdd9                    0/1     ContainerCreating        0                18h
cert-manager                cert-manager-cainjector-7d55bf8f78-4tjxt         0/1     ContainerCreating        0                18h
cert-manager                cert-manager-webhook-577f77586f-2pgdm            0/1     ContainerCreating        0                18h
collabora                   collabora-muflo-5c6bc75746-d77mg                 0/1     ContainerCreating        0                62m
default                     dc-5cb75d894b-lvzgn                              0/1     ContainerCreating        1                37h
guacamole                   db-685c8bd744-mw9tc                              0/1     ContainerCreating        1                37h
guacamole                   guacamole-594d4cb8d8-9n6nd                       0/1     ContainerCreating        1                37h
guacamole                   guacd-869cdd66c4-dcdwx                           0/1     ContainerCreating        1                37h
ingress-nginx               nginx-ingress-controller-vj4l6                   0/1     ContainerCreating        0                17m
ingress-nginx               nginx-ingress-controller-xbn6p                   0/1     ContainerCreating        7 (40h ago)      134d
ingress-nginx               nginx-ingress-controller-z9tcb                   0/1     ContainerCreating        9 (40h ago)      134d
jenkins                     jenkins-79cd6dc9c6-tfvhg                         0/1     ContainerCreating        1                37h
kube-system                 cloud-controller-manager-docker04.mufloland.it   1/1     Running                  0                20m
kube-system                 cloud-controller-manager-docker05.mufloland.it   1/1     Running                  2 (18h ago)      19h
kube-system                 cloud-controller-manager-docker06.mufloland.it   1/1     Running                  0                6m38s
kube-system                 coredns-8578b6dbdd-bc8rl                         0/1     ContainerCreating        6 (40h ago)      134d
kube-system                 coredns-8578b6dbdd-wtl5p                         0/1     ContainerCreating        1                37h
kube-system                 coredns-autoscaler-79dcc864f5-2wggr              0/1     ContainerCreating        6 (40h ago)      134d
kube-system                 etcd-docker04.mufloland.it                       1/1     Running                  0                19m
kube-system                 etcd-docker05.mufloland.it                       1/1     Running                  1 (18h ago)      19h
kube-system                 etcd-docker06.mufloland.it                       1/1     Running                  0                5m55s
kube-system                 helm-install-rke2-canal-qfwxq                    0/1     Pending                  0                19h
kube-system                 helm-install-rke2-coredns-rfbzl                  0/1     Pending                  0                19h
kube-system                 helm-install-rke2-ingress-nginx-vvw56            0/1     ContainerCreating        0                18h
kube-system                 helm-install-rke2-metrics-server-5lkg9           0/1     ContainerCreating        0                19m
kube-system                 kube-apiserver-docker04.mufloland.it             1/1     Running                  0                19m
kube-system                 kube-apiserver-docker05.mufloland.it             1/1     Running                  1 (18h ago)      19h
kube-system                 kube-apiserver-docker06.mufloland.it             1/1     Running                  0                6m40s
kube-system                 kube-controller-manager-docker04.mufloland.it    1/1     Running                  0                20m
kube-system                 kube-controller-manager-docker05.mufloland.it    1/1     Running                  2 (18h ago)      19h
kube-system                 kube-controller-manager-docker06.mufloland.it    1/1     Running                  0                6m38s
kube-system                 kube-proxy-docker01.mufloland.it                 1/1     Running                  0                18m
kube-system                 kube-proxy-docker02.mufloland.it                 1/1     Running                  0                60m
kube-system                 kube-proxy-docker03.mufloland.it                 1/1     Running                  0                49m
kube-system                 kube-proxy-docker04.mufloland.it                 1/1     Running                  0                19m
kube-system                 kube-proxy-docker05.mufloland.it                 1/1     Running                  1 (18h ago)      19h
kube-system                 kube-proxy-docker06.mufloland.it                 1/1     Running                  0                6m30s
kube-system                 kube-scheduler-docker04.mufloland.it             1/1     Running                  0                20m
kube-system                 kube-scheduler-docker05.mufloland.it             1/1     Running                  1 (18h ago)      19h
kube-system                 kube-scheduler-docker06.mufloland.it             1/1     Running                  0                6m38s
kube-system                 metrics-server-6bc7854fb5-cmrfm                  0/1     ContainerCreating        14 (40h ago)     199d
kube-system                 weave-net-4pv6g                                  3/3     Running                  6 (13m ago)      134d
kube-system                 weave-net-9qnwq                                  3/3     Running                  0                18m
kube-system                 weave-net-f5wfn                                  3/3     Running                  10 (19m ago)     129d
kube-system                 weave-net-gmgnc                                  1/3     CrashLoopBackOff         27 (2m15s ago)   129d
kube-system                 weave-net-gsx9v                                  3/3     Running                  3 (18h ago)      129d
kube-system                 weave-net-kf4f7                                  3/3     Running                  3                14d
longhorn-system             csi-attacher-5ddf9c48cf-6kt9x                    0/1     ContainerCreating        9 (19h ago)      45h
longhorn-system             csi-attacher-5ddf9c48cf-mdmkg                    0/1     ContainerCreating        0                26m
longhorn-system             csi-attacher-5ddf9c48cf-rjf6p                    0/1     ContainerCreating        1 (40h ago)      40h
longhorn-system             csi-provisioner-59b7b8b7b8-jbvkf                 0/1     ContainerCreating        0                26m
longhorn-system             csi-provisioner-59b7b8b7b8-jxg77                 0/1     Error                    12 (18h ago)     45h
longhorn-system             csi-provisioner-59b7b8b7b8-v9grf                 0/1     ContainerCreating        0                26m
longhorn-system             csi-resizer-68ccff94-7fcqw                       0/1     ContainerCreating        7 (19h ago)      45h
longhorn-system             csi-resizer-68ccff94-kr4vl                       0/1     ContainerCreating        0                26m
longhorn-system             csi-resizer-68ccff94-krpft                       0/1     ContainerCreating        0                26m
longhorn-system             csi-snapshotter-6d7d679c98-hsmkf                 0/1     ContainerCreating        9 (18h ago)      45h
longhorn-system             csi-snapshotter-6d7d679c98-mbw4j                 0/1     ContainerCreating        0                26m
longhorn-system             csi-snapshotter-6d7d679c98-nxft7                 0/1     ContainerCreating        0                26m
longhorn-system             engine-image-ei-0422ab0c-8b47t                   0/1     ContainerCreating        0                18m
longhorn-system             engine-image-ei-0422ab0c-h5dr4                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-0422ab0c-w7wxn                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-47b59147-5mmkn                   0/1     ContainerCreating        0                18m
longhorn-system             engine-image-ei-47b59147-kqxkd                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-47b59147-qfx67                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-d474e07c-fzg9c                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-d474e07c-hfqxg                   0/1     ContainerCreating        0                18m
longhorn-system             engine-image-ei-d474e07c-v5d8d                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-edd4cae3-2s5c6                   0/1     ContainerCreating        0                18m
longhorn-system             engine-image-ei-edd4cae3-flsdj                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             engine-image-ei-edd4cae3-j948g                   0/1     ContainerCreating        2 (40h ago)      45h
longhorn-system             instance-manager-e-3ca11cf1                      0/1     ContainerStatusUnknown   1                40h
longhorn-system             instance-manager-r-29d5f090                      0/1     ContainerStatusUnknown   1                40h
longhorn-system             longhorn-admission-webhook-744ddbbbd8-7svhr      0/1     Init:0/1                 0                26m
longhorn-system             longhorn-admission-webhook-744ddbbbd8-s6hnc      0/1     PodInitializing          2 (40h ago)      2d3h
longhorn-system             longhorn-conversion-webhook-9864564d8-4mggz      0/1     ContainerCreating        5 (40h ago)      47h
longhorn-system             longhorn-conversion-webhook-9864564d8-r62rs      0/1     ContainerCreating        0                26m
longhorn-system             longhorn-csi-plugin-9nk5s                        0/2     ContainerCreating        7 (40h ago)      45h
longhorn-system             longhorn-csi-plugin-fdpmn                        0/2     ContainerCreating        0                18m
longhorn-system             longhorn-csi-plugin-qbs5f                        0/2     ContainerCreating        5 (40h ago)      45h
longhorn-system             longhorn-driver-deployer-7d4d6d6cb-l7lh8         0/1     PodInitializing          3 (40h ago)      2d3h
longhorn-system             longhorn-manager-58mxv                           0/1     Init:0/1                 0                18m
longhorn-system             longhorn-manager-8vnxk                           0/1     PodInitializing          2 (40h ago)      45h
longhorn-system             longhorn-manager-l4t58                           0/1     PodInitializing          2 (40h ago)      45h
longhorn-system             longhorn-ui-75646c6c6f-9wdr6                     0/1     ContainerCreating        4 (40h ago)      2d3h
registry                    registry-64c5db47fc-zn7fm                        0/1     ContainerCreating        1                37h

Any suggestions?

travisghansen commented 2 years ago

You should probably run kubectl describe on those pods and see why the are stuck. Might give you an idea why they are not starting.

moophlo commented 2 years ago

Almost there but still no luck. Canal get deployed once I delete weave-net plugin of course! But still nginx and coredns wont start:

NAME                                                    READY   STATUS             RESTARTS        AGE
cloud-controller-manager-docker04.mufloland.it          1/1     Running            10 (33m ago)    30h
cloud-controller-manager-docker05.mufloland.it          1/1     Running            12 (33m ago)    2d2h
cloud-controller-manager-docker06.mufloland.it          1/1     Running            10 (33m ago)    30h
etcd-docker04.mufloland.it                              1/1     Running            4 (34m ago)     30h
etcd-docker05.mufloland.it                              1/1     Running            5 (34m ago)     2d2h
etcd-docker06.mufloland.it                              1/1     Running            4 (34m ago)     30h
helm-install-rke2-canal-tvxf4                           0/1     Completed          0               4h31m
helm-install-rke2-coredns-r27hw                         0/1     CrashLoopBackOff   3 (18s ago)     79s
helm-install-rke2-ingress-nginx-dnwv6                   0/1     CrashLoopBackOff   5 (29s ago)     3m46s
helm-install-rke2-metrics-server-k4rk9                  0/1     Completed          0               4h25m
kube-apiserver-docker04.mufloland.it                    1/1     Running            5 (33m ago)     30h
kube-apiserver-docker05.mufloland.it                    1/1     Running            6 (33m ago)     2d2h
kube-apiserver-docker06.mufloland.it                    1/1     Running            6 (33m ago)     30h
kube-controller-manager-docker04.mufloland.it           1/1     Running            10 (33m ago)    30h
kube-controller-manager-docker05.mufloland.it           1/1     Running            11 (33m ago)    2d2h
kube-controller-manager-docker06.mufloland.it           1/1     Running            10 (33m ago)    30h
kube-proxy-docker01.mufloland.it                        1/1     Running            2 (6m51s ago)   30h
kube-proxy-docker02.mufloland.it                        1/1     Running            0               26m
kube-proxy-docker03.mufloland.it                        1/1     Running            1 (30m ago)     3h16m
kube-proxy-docker04.mufloland.it                        1/1     Running            4 (34m ago)     30h
kube-proxy-docker05.mufloland.it                        1/1     Running            5 (34m ago)     2d2h
kube-proxy-docker06.mufloland.it                        1/1     Running            4 (34m ago)     30h
kube-scheduler-docker04.mufloland.it                    1/1     Running            5 (34m ago)     30h
kube-scheduler-docker05.mufloland.it                    1/1     Running            5 (34m ago)     2d2h
kube-scheduler-docker06.mufloland.it                    1/1     Running            4 (34m ago)     30h
rke2-canal-79dsh                                        2/2     Running            2 (6m51s ago)   42m
rke2-canal-8c94w                                        2/2     Running            4 (26m ago)     42m
rke2-canal-bg289                                        2/2     Running            2 (30m ago)     42m
rke2-canal-fk4qt                                        2/2     Running            2 (34m ago)     42m
rke2-canal-j5l5v                                        2/2     Running            2 (34m ago)     42m
rke2-canal-s6zqm                                        2/2     Running            2 (34m ago)     42m
rke2-coredns-rke2-coredns-545d64676-7p29k               0/1     Running            0               17s
rke2-coredns-rke2-coredns-545d64676-ls8hv               0/1     Running            0               15s
rke2-coredns-rke2-coredns-autoscaler-6bf4775c97-crszm   1/1     Running            0               17s
rke2-metrics-server-6564db4569-lms4p                    1/1     Running            1 (30m ago)     40m

coredns

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  2m59s                default-scheduler  Successfully assigned kube-system/helm-install-rke2-coredns-r27hw to docker06.mufloland.it
  Normal   Pulling    2m59s                kubelet            Pulling image "rancher/klipper-helm:v0.7.3-build20220613"
  Normal   Pulled     2m51s                kubelet            Successfully pulled image "rancher/klipper-helm:v0.7.3-build20220613" in 8.398116625s
  Normal   Created    64s (x5 over 2m50s)  kubelet            Created container helm
  Normal   Started    64s (x5 over 2m50s)  kubelet            Started container helm
  Normal   Pulled     64s (x4 over 2m48s)  kubelet            Container image "rancher/klipper-helm:v0.7.3-build20220613" already present on machine
  Warning  BackOff    51s (x9 over 2m45s)  kubelet            Back-off restarting failed container

ingress-nginx

  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m                     default-scheduler  Successfully assigned kube-system/helm-install-rke2-ingress-nginx-dnwv6 to docker03.mufloland.it
  Normal   Pulling    5m59s                  kubelet            Pulling image "rancher/klipper-helm:v0.7.3-build20220613"
  Normal   Pulled     5m52s                  kubelet            Successfully pulled image "rancher/klipper-helm:v0.7.3-build20220613" in 7.638299786s
  Normal   Created    4m18s (x5 over 5m52s)  kubelet            Created container helm
  Normal   Started    4m18s (x5 over 5m51s)  kubelet            Started container helm
  Normal   Pulled     4m18s (x4 over 5m49s)  kubelet            Container image "rancher/klipper-helm:v0.7.3-build20220613" already present on machine
  Warning  BackOff    48s (x23 over 5m46s)   kubelet            Back-off restarting failed container

ArthurMcTool commented 2 years ago

I think in order to get those pods going (ones starting with helm-install) you'll need to taint your masters so they are schedulable. You can check the pod logs to make sure but I ran into that before. Then once the pods run, some of the other parts should fall into place. Also dont forget afterwards to taint the masters to make them 'un-schedulable'. Something like this:

To get rke-canal and rke-coredns install job pods to run, remove these taints from a master node:

kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd-
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane-

Once the kube is 99% stable, re-apply the taints:

kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd=true:NoExecute
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane=true:NoSchedule`

moophlo commented 2 years ago

Thanks for help! rke-canal get installed just fine, rke-coredns and rke2-ingress-nginx helm install just continuously crashes. I've tried removing all the taints from all the master nodes or even leave the taints there, same behaviour. Should I remove the taints only from one of the three master nodes and leave the taints on the others?

moophlo commented 2 years ago

Well this is intresting:

Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: IngressClass "nginx" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "rke2-ingress-nginx"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "kube-system"
+ exit

travisghansen commented 2 years ago

Did you run the migration yaml to remove all the rke1 addons entirely first?

ArthurMcTool commented 2 years ago

I found that I had to manually remove the old RKE1 components. In my instructions there's a section there titled: "Remove all this from the RKE1 Cluster". All of those things need to be removed and the final step is to remove the ingress-nginx namespace. Also note that within that section there's a subsection titled "CLUSTER SPECIFIC", the names of those pods will be indicative of your cluster...

moophlo commented 2 years ago

Now all nodes are up&running just fine. But it's still applying the manifest for canal metrics and coredns in an infinite loop. Should I manually remove the manifests in /var/lib/rancher/rke2/server/manifests/ manually? Also I'm getting a weird error from one of my pods about a failing volume mount (from longhorn):

MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file

EDIT: that error was my fault, I had to bind-mount docker socket to get docker plugin in jenkins working!

moophlo commented 2 years ago

No way I can get rid of this "migration-agent-addons-remove". Removed the job I just pop up again infinitely, removed from /var/lib/rancher/rke2/server/manifests/ on the 3 master nodes and reboot them has no effect. Removed from Rancher UI in the addons section. It still pop up and fail forever!

ArthurMcTool commented 2 years ago

Ya, you're right. Even with the pods started properly I still could not get rid of the migration-agent-addons-remove errors either..

travisghansen commented 2 years ago

Remove the file from the manifests on all controllers. Then kubectl delete the file from your workstation.

jhoblitt commented 1 year ago

Ouch. If CRDs are unable to be migrated than you might as well do a clean reinstall.

Heiko-san commented 1 year ago

Hi,

when searching for a way to migrate a rancher rke1 cluster to rke2, this is what I found. Isn't there any kind of "official documentation" on this topic? What about the migration-agent (https://github.com/galal-hussein/migration-agent) used here? It seems there is no documentation of it as well, where did you find infos how to use it? And it seems it has not been updated for a very long time, is it still "recommended" to use it?

ArthurMcTool commented 1 year ago

Unfortunately this has also been a sore spot and disappointment for me and our team. Documentation is sketchy at best and I feel Rancher has completely dropped the ball on the RKE2 migration. We learnt to upgrade to RKE2 simply by brute force try and try again. The migration-agent seems to only reconfigure some of your config files and does not actually perform a migration. Spinning up RKE2 stalls every time it comes across a resource that already exists - so it's up to you to remove all that stuff before hand. The notes above are tedious - but they do seem to work. I hope the Rancher team will have a better approach for RKE3...

Heiko-san commented 1 year ago

Thanks for your answer, I actually saw your howto here, it seems to be our best bet up to now. However, actually if you spin up a new rke2 cluster in Rancher there are many Rancher-side changes aswell, e.g. NodeTemplates are dropped and moved towards fleet-agent using "machine templates". I didn't read anything about this in your howto, will these changes be honored if we do a migration like you did?

Heiko-san commented 1 year ago

@ArthurMcTool where/how did you get the migration-agent? It seems it doesn't compile (anymore?). Do you have the old version you used or is there anywhere you can download the binary?

ArthurMcTool commented 1 year ago

@Heiko-san - there are links above (Mar 21), search this page for "download". I don't maintain the binaries so hopefully they still work..

pranavarnav2 commented 1 year ago

Any update on this? @caroline-suse-rancher

cambierr commented 1 year ago

Since my previous message seems to have been a bit too rough, here is another attempt at being constructive for those having issues in this migration.

Since we were stuck too, we decided to take another direction and move to Kubespray. A "how to" guide is available at https://github.com/cambierr/rke-to-kubespray/tree/main to do this migration without downtime

cwayne18 commented 1 year ago

At this point, for standalone RKE clusters, https://github.com/rancher/migration-agent achieves most of what this issue is asking for. Migrating rancher-provisioned RKE clusters to RKE2 is not really an RKE2 issue, and there is not any imminent need to migrate from RKE to RKE2, as RKE is still supported at this time.

brandond commented 1 year ago

For the record, we are not currently planning to support in-place conversion of RKE clusters to RKE2. The number of possible edge cases is too high given the wide variety of administrator customizations and user workloads. There is no way to roll back the migration should it cause problems, leading to the potential for critical-severity outages for users that attempt it.

Users should build new RKE2 clusters and migrate individual workloads over. This offers the possibility to find issues before moving into production, and fall back to the untouched existing cluster should problems arise.

rancher / rke2