Closed davidnuzik closed 1 year ago
This is really disconnected from our release cycle. At a later date, we need to decide how it should be tracked against milestones.
Apart from the meta issue of release cycle connection, the actual RKE->RKE2 migration path work is being done in the meantime, right? This is what interests everybody mainly, it would be great to have an update how things are going.
Follow along: https://github.com/galal-hussein/migration-agent
Any update on the Migration tool? I currently have a rancher provisioned rke1 cluster, would it be possible to upgrade the cluster to rke2?
Any news?
Any news?
Found this in the documentation. Hope it’s helpful. https://docs.rke2.io/migration/
The migration document is somewhat helpful. However the image for the migration-agent tool as referenced in the manifest file does not appear to exist. I opened the following issue: https://github.com/rancher/migration-agent/issues/15
Is there somewhere else to get the migration-agent binary?
Success for me (with some headaches..as expected). I took a slightly different approach and downloaded the binary directly to the nodes to run the agent.
Rough steps for me:
./migration-agent-amd64 <s3 options> --snapshot rke1snapshot.zip \
--node-name node01 \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-controller-manager.yaml
./migration-agent-amd64 <s3 options> --snapshot rke1snapshot.zip \
--disable-etcd-restore \
--node-name node02 \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-node.yaml
manifests
directory and the /etc/rancher/rke2/config.yaml.d/10-migration.yaml
file as appropriate/etc/rancher/rke2/config.yaml
on nodes per their roles/etc/rancher/rke2/registries.yaml
on nodes (for registry mirrors)rke2
on all the nodesrke2-server
servicerke2-agent
serviceI ran into the following issues:
migration-agent-addons-remove.yaml
had a couple issues
apiVersion: rbac.authorization.k8s.io/v1beta1
is not valid (on newer k8s versions), so the yaml failed entirelyrke1
so it referenced invalid configmaps (which prevented the job from running)I have the following remaining questions:
/etc/kubernetes/...
still relevant?/var/lib/etcd/
/var/lib/rancher/rke/
configmaps
? old jobs
? old secrets
?roles
are relevant now and how do I clean them up control-plane,controlplane,etcd,master,worker
?For a relatively complex rke
install (almost 4 years old, custom cni, custom ingress, custom kubelet args/env, etc), the process went quite smooth IMO. Especially for someone with no prior experience with rke2
.
That's great thanks for that @travisghansen. But to my immediate question and your 2nd bullet point - if you could share with me where I can download 'migration-agent-amd64', I would really appreciate it....
@ArthurMcTool You can download the latest binary from https://github.com/rancher/migration-agent/releases/latest and the image husseingalal/migration-agent:dev from the manifest exists for me, see https://hub.docker.com/r/husseingalal/migration-agent/tags.
Awesome, thanks!
I have an existing RKE1 cluster with three Master Servers and three worker (Agent) nodes. I've followed @travisghansen steps almost to a tee. The end game, once I upgrade all three Master Servers is that depending on which master server I point my kube config to, that server only sees itself as having been upgraded and shows the others as NotReady and at the RKE1 version.
Examples:
NAME STATUS ROLES AGE VERSION
server1 Ready control-plane,controlplane,etcd,master 72m v1.22.7+rke2r1
server2 NotReady controlplane,etcd 72m v1.21.5
server3 NotReady controlplane,etcd 72m v1.21.5
NAME STATUS ROLES AGE VERSION
server1 NotReady controlplane,etcd 72m v1.21.5
server2 Ready control-plane,controlplane,etcd,master 72m v1.22.7+rke2r1
server3 NotReady controlplane,etcd 72m v1.21.5
NAME STATUS ROLES AGE VERSION
server1 NotReady controlplane,etcd 72m v1.21.5
server2 NotReady controlplane,etcd 72m v1.21.5
server3 Ready control-plane,controlplane,etcd,master 72m v1.22.7+rke2r1
I have a different config.yaml for my primary server, and another one for the secondary servers.
Primary config.yaml:
token: "my-token"
cluster-domain: "my-rancher-domain.com"
tls-san: "kubeapi.my-rancher-domain.com"
Secondary config.yaml:
server: "https://server1:9345"
token: "my-token"
cluster-domain: "my-rancher-domain.com"
tls-san: "kubeapi.my-rancher-domain.com"
I've setup a VIP and Loadbalancer for ports 6443/9345 (kubeapi.my-rancher-domain.com), I've used that in my config.yaml file when the 'server1:9345' setting didn't seem to work.
It seems like when I run rke2-server on a master, it gets upgraded but knows little if anything about the rest of the cluster. What am I missing here?
I haven’t attempted the migration with a multi controller cluster yet but I would probably only restore the db on the very first controller, and then join the additional controllers to that and let them sync the etcd db as if joining a brand new node.
Maybe you did that, if so ignore the comment :)
Hey guys, so I was able to get a cluster (3 Masters, 3 Workers) sort of migrated from 1.21.7 to v1.22.7+rke2r1 by following the above suggestion of @travisghansen to only restore the etcd database in the first master. But I've come across some other issues.
A couple peculiarities with the workers:
I had to drain and remove the existing nodes one at a time and upgrade them other wise I wound up with two entries for each worker node, one with the domain name and one without. Not sure why the domain name is omitted, I guess I'll have to rename them...
Also noticed that the workers were not labelled (workers) so I had to manually label them (kubectl label node worker01 node-role.kubernetes.io/worker=true)
NAME STATUS ROLES AGE VERSION
master01.mydomain.com Ready control-plane,controlplane,etcd,master 127m v1.22.7+rke2r1
master02.mydomain.com Ready control-plane,controlplane,etcd,master 127m v1.22.7+rke2r1
master03.mydomain.com Ready control-plane,controlplane,etcd,master 127m v1.22.7+rke2r1
worker01 Ready 26m v1.22.7+rke2r1
worker02 Ready 25m v1.22.7+rke2r1
worker03 Ready 21m v1.22.7+rke2r1
Then I also have a bunch of pods that are stuck pending:
k describe pods -n kube-system helm-install-rke2-canal--1-zm4hh:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate.
And another one: k describe pods -n ingress-nginx nginx-ingress-controller-x2rtp:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 8m52s default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
I'll work on uploading my notes for my manual migration here to this chat. They are an extension of what Travis provided, but quote a difference from those on the migration page.
If anyone has any suggestions - they would be appreciated...
Those errors seem pretty odd given the circumstances. It almost appears as if it's the old rke1
addons (which should have been removed).
For example, the controlplane
role
was an rke1
-ism...in rke2
it's control-plane
.
On a fresh install of rke2
the ingress-nginx
deploy also appears to be going to kube-system
ns not ingress-nginx
.
My rke1
installs had most of the addons disabled anyway which eliminated some of these issues for sure. For what it's worth on a fresh rke2
install the roles
for the nodes are:
server
: control-plane,etcd,master
agent
: <none>
(as you might expect, no labels are added in this case)Thanks @travisghansen, we don't use the addons in our RKE1 clusters. For instance, k get addons -A
returns nothing. During the migration (rke2-server startup) I do see it executing the manifest to remove addons but it provides a warning or error that it can't find the addons file in /etc/rke_addon/... Is the migration tool still in development? Has anyone been able to successfully migrate a cluster - something with multiple masters and workers?
thanks..
(Curious) given the output in my previous post, what kind of state or condition is my cluster actually in? How can I manage these pending pods?
I'm referring to this (not some sort of resource called addons
):
Hello again,
So I've added the following to my /etc/rancher/rke2/config.yaml in order to disable addons:
disable: rke2-canal
disable: rke2-canal-config
disable: rke2-coredns
disable: rke2-coredns-config
disable: rke2-ingress-nginx
disable: rke2-ingress-nginx-config
disable: rke2-metrics-server
disable: rke2-metrics-server-config
I did all the other things mentioned earlier in order to migrate, then I ran rke2-server on the first master node and still got stuck with all this:
k get pods -A | grep -vi running | grep -vi completed
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-rke2-canal--1-vljg2 0/1 CrashLoopBackOff 13 (88s ago) 21m
kube-system helm-install-rke2-coredns--1-fmwnh 0/1 CrashLoopBackOff 13 (91s ago) 21m
kube-system helm-install-rke2-ingress-nginx--1-zz9qb 0/1 Terminating 0 21m
kube-system helm-install-rke2-metrics-server--1-p95jn 0/1 CrashLoopBackOff 6 (79s ago) 7m28s
kube-system rke2-ingress-nginx-controller-d6hrx 0/1 Pending 0 15m
I can't see how this migration tool is ready and our organisation has a lot invested in Rancher and we need a migration path. Can any of the developers comment on the status of this migration tool project, what has been tested, what works/doesn't etc?
thanks...
@galal-hussein anything additional we need to add to this before closing it?
The tool has a lot of rough edges IMO to be considered ready for business use. Perhaps it’s as good as it’s going to get though.
I found a bug in the migration tool for kubelet args:
The following configuration from my rke1 cluster.yaml
services:
kubelet:
extra_args:
resolv-conf: "/run/resolvconf/resolv.conf" # for systemd-resolvd
max-pods: 150
enforce-node-allocatable: "pods"
system-reserved: "cpu=150m,memory=150Mi,ephemeral-storage=1Gi"
kube-reserved: "cpu=150m,memory=150Mi,ephemeral-storage=1Gi"
eviction-hard: "memory.available<500Mi,nodefs.available<10%"
Turns into
"kubelet-arg":"enforce-node-allocatable=pods,eviction-hard=memory.available\u003c500Mi,nodefs.available\u003c10%,kube-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi,max-pods=150,resolv-conf=/run/resolvconf/resolv.conf,system-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi"
But I had to modify it to be:
"kubelet-arg":["enforce-node-allocatable=pods","eviction-hard=memory.available<500Mi,nodefs.available<10%","kube-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi","max-pods=150","resolv-conf=/run/resolvconf/resolv.conf","system-reserved=cpu=150m,memory=150Mi,ephemeral-storage=1Gi"]
It did not make the options an array and it parsed eviction-hard=memory.available<500Mi,nodefs.available<10%
to eviction-hard=memory.available\u003c500Mi,nodefs.available\u003c10%
After fixing that, I'm getting the following error:
Jun 23 10:12:12 k8s-master1 rke2[8709]: time="2022-06-23T10:12:12-04:00" level=info msg="Reconciling bootstrap data between datastore and disk"
Jun 23 10:12:12 k8s-master1 rke2[8709]: time="2022-06-23T10:12:12-04:00" level=fatal msg="Failed to reconcile with temporary etcd: bootstrap data already found and encrypted with different token"
I have an 8 worker cluster with 3 controllers. I start by stopping docker on the 3 controllers. Then on the first controller, running the migration tool, fixing the config error above, and starting rke2. Eventually, I run into this issue every time.
I had meant to post this earlier, hopefully it will help someone...
This ticket will be used to upgrade our three master, 7 worker (Agent) cluster from:
It is unfortunate that the migration tool only copies a few config files. The bulk of the heavy lifting has to be done manually. I've run through this 100 or more times in order to find all the little things that we needed to do to successfully migrate our clusters.
There are many things (crd's, service accounts, certain deployments) that need to be removed from your RKE1 cluster first - otherwise the migration fails.
Remove all this from the RKE1 Cluster:
k delete daemonset -n kube-system canal
k delete crd bgpconfigurations.crd.projectcalico.org
k delete crd bgppeers.crd.projectcalico.org
k delete crd blockaffinities.crd.projectcalico.org
k delete crd clusterinformations.crd.projectcalico.org
k delete crd felixconfigurations.crd.projectcalico.org
k delete crd globalnetworkpolicies.crd.projectcalico.org
k delete crd globalnetworksets.crd.projectcalico.org
k delete crd hostendpoints.crd.projectcalico.org
k delete crd ipamblocks.crd.projectcalico.org
k delete crd ipamconfigs.crd.projectcalico.org
k delete crd ipamhandles.crd.projectcalico.org
k delete crd ippools.crd.projectcalico.org
k delete crd kubecontrollersconfigurations.crd.projectcalico.org
k delete crd networkpolicies.crd.projectcalico.org
k delete crd networksets.crd.projectcalico.org
k delete deployments -n kube-system calico-kube-controllers
k delete deployments -n kube-system coredns
k delete deployments -n kube-system coredns-autoscaler
k delete deployments -n kube-system metrics-server
### CLUSTER SPECIFIC:
k delete pods -n kube-system rke-coredns-addon-deploy-job-4chs4
k delete pods -n kube-system rke-ingress-controller-deploy-job-xqw8f
k delete pods -n kube-system rke-metrics-addon-deploy-job-rf78v
k delete pods -n kube-system rke-network-plugin-deploy-job-9hjtz
k delete serviceaccounts -n kube-system canal
k delete serviceaccounts -n kube-system coredns
k delete clusterrole flannel
k delete clusterrolebinding canal-flannel
k delete clusterrolebinding canal-calico
k delete apiservice v1beta1.metrics.k8s.io
k delete ns ingress-nginx
From your workstation and using the correct version of RKE1 and kube config for this cluster, create the etcd backup:
../rke_1.3.1 etcd snapshot-save --config my-cluster.yml --name my-cluster
Download migration-agent to all cluster nodes directly:
wget -O /usr/local/bin/migration-agent-amd64 http://internal-server/artifacts/RKE/migration-agent-amd64
chmod 755 /usr/local/bin/migration-agent-amd64
Run migration agent on primary master node only:
/usr/local/bin/migration-agent-amd64 --snapshot /opt/rke/etcd-snapshots/my-cluster.zip \
--node-name master1.mydomain.com \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-controller-manager.yaml
Run migration agent on all secondary master nodes individually (change the node name as you go):
/usr/local/bin/migration-agent-amd64 --snapshot /opt/rke/etcd-snapshots/my-cluster.zip \
--disable-etcd-restore \
--node-name master[2,3].mydomain.com \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-controller-manager.yaml
If there are any workers, on the primary master node temporarily modify the permissions on the backup file:
chmod 644 /opt/rke/etcd-snapshots/my-cluster.zip
If there are any workers, create the snapshot directory on them and then scp the backup file from the primary master to each of the workers:
mkdir -p /opt/rke/etcd-snapshots
scp ${user}@master1:/opt/rke/etcd-snapshots/my-cluster.zip /opt/rke/etcd-snapshots/
On ALL nodes:
chmod 600 /opt/rke/etcd-snapshots/*
ls -lah /opt/rke/etcd-snapshots/
Run the migration agent on each of the workers individually (change the node name as you go):
/usr/local/bin/migration-agent-amd64 --snapshot /opt/rke/etcd-snapshots/my-cluster.zip \
--disable-etcd-restore \
--node-name worker[1,2,3,4,5,6,7].mydomain.com \
--kubeconfig /etc/kubernetes/ssl/kubecfg-kube-node.yaml
Add additional firewall rules to all nodes if the node is using iptables and not firewalld:
iptables -D INPUT -j LOG_DROP
iptables -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 5473 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9098 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9345 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 30000:32767 -j ACCEPT
iptables -A INPUT -p udp -m udp --dport 51820 -j ACCEPT
iptables -A INPUT -p udp -m udp --dport 51821 -j ACCEPT
iptables -A INPUT -j LOG_DROP
wget -O /etc/iptables/rules.v4 http://internal-server/artifacts/rules.v4
Create the rancher config directory on all nodes:
mkdir -p /etc/rancher/rke2/
Get the RKE2 Binary for all nodes:
TMP_TARBALL="/tmp/rke2-1.22.7+rke2r1.tar.gz"
wget -O /tmp/rke2-1.22.7+rke2r1.tar.gz http://internal-server/artifacts/RKE/rke2-1.22.7+rke2r1.tar.gz
tar xzf "${TMP_TARBALL}" -C /usr/local
mv -f /usr/local/lib/systemd/system/rke2-*.service /etc/systemd/system/
systemctl daemon-reload
Edit the rancher config.yaml on the primary master node. (NOTE ${TOKEN} - create your own secret token):
echo -e "token: \"${TOKEN}\"\ncluster-domain: \"my-cluster-domain.com\"\ntls-san: \"kubeapi.my-cluster-domain.com\"" > /etc/rancher/rke2/config.yaml
Edit the rancher config.yaml on all secondary master nodes:
echo -e "server: \"https://kubeapi.my-cluster-domain.com:9345\"\ntoken: \"${TOKEN}\"\ncluster-domain: \"my-cluster-domain.com\"\ntls-san: \"kubeapi.my-cluster-domain.com\"" > /etc/rancher/rke2/config.yaml
Edit the rancher config.yaml on all worker/agent nodes:
echo -e "server: \"https://kubeapi.my-cluster-domain.com:9345\"\ntoken: \"${TOKEN}\"\ncluster-domain: \"my-cluster-domain.com\"\ntls-san: \"kubeapi.my-cluster-domain.com\"\nnode-name: \"`hostname -A | cut -d " " -f1`\"" > /etc/rancher/rke2/config.yaml
Check the config files:
cat /etc/rancher/rke2/config.yaml
On the master nodes, edit /etc/rancher/rke2/config.yaml.d/10-migration.yaml and change the kube-dns ip as follows. Yours might be slightly different, the point is to change the host portion of the ip to something different than the original:
sed -i 's/10.233.30.10/10.233.30.11/g' /etc/rancher/rke2/config.yaml.d/10-migration.yaml
Start the upgrade on the primary master:
systemctl disable docker
systemctl disable docker.socket
systemctl stop docker
systemctl stop docker.socket
reboot
rm -Rf /var/lib/docker
systemctl enable rke2-server.service
systemctl start rke2-server.service
You might need to change the server line in your kube config to point to the IP address of the primary master. Once the upgrade is complete you can change it to the VIP IP.
Start the upgrade on all secondary masters individually, then move on to the next node:
systemctl disable docker
systemctl disable docker.socket
systemctl stop docker
systemctl stop docker.socket
reboot
rm -Rf /var/lib/docker
systemctl enable rke2-server.service
systemctl start rke2-server.service
Remove the old controlplane label from the master nodes:
kubectl label node master1.mydomain.com node-role.kubernetes.io/controlplane-
kubectl label node master2.mydomain.com node-role.kubernetes.io/controlplane-
kubectl label node master3.mydomain.com node-role.kubernetes.io/controlplane-
let it smoke for a while...
To get rke-canal and rke-coredns install job pods to run, remove these taints from a master node:
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd-
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane-
Once the kube is 99% stable, re-apply the taints:
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd=true:NoExecute
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane=true:NoSchedule
Once the cluster is on a good state, start the upgrade on worker nodes individually. If your agent nodes are also your masters * then skip this step:
systemctl disable docker
systemctl disable docker.socket
systemctl stop docker
systemctl stop docker.socket
reboot
rm -Rf /var/lib/docker
systemctl enable rke2-agent.service
systemctl start rke2-agent.service
Get a list of completed jobs to delete (if any):
k get pods -A | grep 0/ | awk '{$1=$1};1' | cut -d " " -f 1,2
Create a backup tgz file of the /etc/rancher/rke2 directory from the primary master:
CLUSTERBASE=my-cluster
KUBEAPIURL="kubeapi.my-cluster-domain.com"
cp /etc/rancher/rke2/rke2.yaml /etc/rancher/rke2/kube_config
sed -i "s/127.0.0.1/${KUBEAPIURL}/g" /etc/rancher/rke2/kube_config
tar -cvzf /etc/rancher/rke2/$CLUSTERBASE-cluster.tgz -C /etc/rancher/rke2 config.yaml kube_config config.yaml.d
SCP that file to your local workstation and safely store.
@ArthurMcTool Thanks for your detail! I'll give this a try.
Almost seems as though I should just remove the rke1 items like you talked about, take a velero backup, and restore to a rke2 cluster.
I guess I've done wrong the "(NOTE ${TOKEN} - create your own secret token)" part as I've the master node ready and both the secondary masters "Not Ready"
NAME STATUS ROLES AGE VERSION
docker01.xxx.it Ready worker 329d v1.22.9
docker02.xxx.it Ready worker 329d v1.22.9
docker03.xxx.it Ready worker 329d v1.22.9
docker04.xxx.it NotReady etcd 129d v1.22.9
docker05.xxx.it Ready control-plane,etcd,master 129d v1.23.8+rke2r1
docker06.xxx.it NotReady etcd 129d v1.22.9
Ok I figured it out, I've missed an argument in the migration command (--disable-etcd-restore). Now this is the situation:
NAME STATUS ROLES AGE VERSION
docker01.xxx.it Ready worker 20m v1.23.8+rke2r1
docker02.xxx.it Ready worker 330d v1.23.8+rke2r1
docker03.xxx.it Ready worker 330d v1.23.8+rke2r1
docker04.xxx.it Ready control-plane,etcd,master 129d v1.23.8+rke2r1
docker05.xxx.it Ready control-plane,etcd,master 129d v1.23.8+rke2r1
docker06.xxx.it Ready control-plane,etcd,master 129d v1.23.8+rke2r1
But now the problem is that almost nothing is starting properly:
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-fleet-local-system fleet-agent-6fc847b9dd-kclhl 0/1 ContainerCreating 0 26m
cattle-fleet-system fleet-controller-7bbcb965f9-mwpwx 0/1 ContainerCreating 0 26m
cattle-fleet-system gitjob-55448cdfd7-8wnp2 0/1 ContainerCreating 0 26m
cattle-resources-system rancher-backup-6968b9cb8f-bk56w 0/1 ContainerCreating 0 18h
cattle-system helm-operation-l6mks 0/2 Completed 1 22h
cattle-system rancher-5677f59677-4x4b6 0/1 ContainerCreating 0 26m
cattle-system rancher-5677f59677-ddzrh 0/1 Error 2 45h
cattle-system rancher-5677f59677-qvmt5 0/1 ContainerCreating 2 40h
cattle-system rancher-webhook-675bccfc59-cp78h 0/1 ContainerCreating 0 26m
cert-manager cert-manager-6d6bb4f487-xkdd9 0/1 ContainerCreating 0 18h
cert-manager cert-manager-cainjector-7d55bf8f78-4tjxt 0/1 ContainerCreating 0 18h
cert-manager cert-manager-webhook-577f77586f-2pgdm 0/1 ContainerCreating 0 18h
collabora collabora-muflo-5c6bc75746-d77mg 0/1 ContainerCreating 0 62m
default dc-5cb75d894b-lvzgn 0/1 ContainerCreating 1 37h
guacamole db-685c8bd744-mw9tc 0/1 ContainerCreating 1 37h
guacamole guacamole-594d4cb8d8-9n6nd 0/1 ContainerCreating 1 37h
guacamole guacd-869cdd66c4-dcdwx 0/1 ContainerCreating 1 37h
ingress-nginx nginx-ingress-controller-vj4l6 0/1 ContainerCreating 0 17m
ingress-nginx nginx-ingress-controller-xbn6p 0/1 ContainerCreating 7 (40h ago) 134d
ingress-nginx nginx-ingress-controller-z9tcb 0/1 ContainerCreating 9 (40h ago) 134d
jenkins jenkins-79cd6dc9c6-tfvhg 0/1 ContainerCreating 1 37h
kube-system cloud-controller-manager-docker04.mufloland.it 1/1 Running 0 20m
kube-system cloud-controller-manager-docker05.mufloland.it 1/1 Running 2 (18h ago) 19h
kube-system cloud-controller-manager-docker06.mufloland.it 1/1 Running 0 6m38s
kube-system coredns-8578b6dbdd-bc8rl 0/1 ContainerCreating 6 (40h ago) 134d
kube-system coredns-8578b6dbdd-wtl5p 0/1 ContainerCreating 1 37h
kube-system coredns-autoscaler-79dcc864f5-2wggr 0/1 ContainerCreating 6 (40h ago) 134d
kube-system etcd-docker04.mufloland.it 1/1 Running 0 19m
kube-system etcd-docker05.mufloland.it 1/1 Running 1 (18h ago) 19h
kube-system etcd-docker06.mufloland.it 1/1 Running 0 5m55s
kube-system helm-install-rke2-canal-qfwxq 0/1 Pending 0 19h
kube-system helm-install-rke2-coredns-rfbzl 0/1 Pending 0 19h
kube-system helm-install-rke2-ingress-nginx-vvw56 0/1 ContainerCreating 0 18h
kube-system helm-install-rke2-metrics-server-5lkg9 0/1 ContainerCreating 0 19m
kube-system kube-apiserver-docker04.mufloland.it 1/1 Running 0 19m
kube-system kube-apiserver-docker05.mufloland.it 1/1 Running 1 (18h ago) 19h
kube-system kube-apiserver-docker06.mufloland.it 1/1 Running 0 6m40s
kube-system kube-controller-manager-docker04.mufloland.it 1/1 Running 0 20m
kube-system kube-controller-manager-docker05.mufloland.it 1/1 Running 2 (18h ago) 19h
kube-system kube-controller-manager-docker06.mufloland.it 1/1 Running 0 6m38s
kube-system kube-proxy-docker01.mufloland.it 1/1 Running 0 18m
kube-system kube-proxy-docker02.mufloland.it 1/1 Running 0 60m
kube-system kube-proxy-docker03.mufloland.it 1/1 Running 0 49m
kube-system kube-proxy-docker04.mufloland.it 1/1 Running 0 19m
kube-system kube-proxy-docker05.mufloland.it 1/1 Running 1 (18h ago) 19h
kube-system kube-proxy-docker06.mufloland.it 1/1 Running 0 6m30s
kube-system kube-scheduler-docker04.mufloland.it 1/1 Running 0 20m
kube-system kube-scheduler-docker05.mufloland.it 1/1 Running 1 (18h ago) 19h
kube-system kube-scheduler-docker06.mufloland.it 1/1 Running 0 6m38s
kube-system metrics-server-6bc7854fb5-cmrfm 0/1 ContainerCreating 14 (40h ago) 199d
kube-system weave-net-4pv6g 3/3 Running 6 (13m ago) 134d
kube-system weave-net-9qnwq 3/3 Running 0 18m
kube-system weave-net-f5wfn 3/3 Running 10 (19m ago) 129d
kube-system weave-net-gmgnc 1/3 CrashLoopBackOff 27 (2m15s ago) 129d
kube-system weave-net-gsx9v 3/3 Running 3 (18h ago) 129d
kube-system weave-net-kf4f7 3/3 Running 3 14d
longhorn-system csi-attacher-5ddf9c48cf-6kt9x 0/1 ContainerCreating 9 (19h ago) 45h
longhorn-system csi-attacher-5ddf9c48cf-mdmkg 0/1 ContainerCreating 0 26m
longhorn-system csi-attacher-5ddf9c48cf-rjf6p 0/1 ContainerCreating 1 (40h ago) 40h
longhorn-system csi-provisioner-59b7b8b7b8-jbvkf 0/1 ContainerCreating 0 26m
longhorn-system csi-provisioner-59b7b8b7b8-jxg77 0/1 Error 12 (18h ago) 45h
longhorn-system csi-provisioner-59b7b8b7b8-v9grf 0/1 ContainerCreating 0 26m
longhorn-system csi-resizer-68ccff94-7fcqw 0/1 ContainerCreating 7 (19h ago) 45h
longhorn-system csi-resizer-68ccff94-kr4vl 0/1 ContainerCreating 0 26m
longhorn-system csi-resizer-68ccff94-krpft 0/1 ContainerCreating 0 26m
longhorn-system csi-snapshotter-6d7d679c98-hsmkf 0/1 ContainerCreating 9 (18h ago) 45h
longhorn-system csi-snapshotter-6d7d679c98-mbw4j 0/1 ContainerCreating 0 26m
longhorn-system csi-snapshotter-6d7d679c98-nxft7 0/1 ContainerCreating 0 26m
longhorn-system engine-image-ei-0422ab0c-8b47t 0/1 ContainerCreating 0 18m
longhorn-system engine-image-ei-0422ab0c-h5dr4 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-0422ab0c-w7wxn 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-47b59147-5mmkn 0/1 ContainerCreating 0 18m
longhorn-system engine-image-ei-47b59147-kqxkd 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-47b59147-qfx67 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-d474e07c-fzg9c 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-d474e07c-hfqxg 0/1 ContainerCreating 0 18m
longhorn-system engine-image-ei-d474e07c-v5d8d 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-edd4cae3-2s5c6 0/1 ContainerCreating 0 18m
longhorn-system engine-image-ei-edd4cae3-flsdj 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system engine-image-ei-edd4cae3-j948g 0/1 ContainerCreating 2 (40h ago) 45h
longhorn-system instance-manager-e-3ca11cf1 0/1 ContainerStatusUnknown 1 40h
longhorn-system instance-manager-r-29d5f090 0/1 ContainerStatusUnknown 1 40h
longhorn-system longhorn-admission-webhook-744ddbbbd8-7svhr 0/1 Init:0/1 0 26m
longhorn-system longhorn-admission-webhook-744ddbbbd8-s6hnc 0/1 PodInitializing 2 (40h ago) 2d3h
longhorn-system longhorn-conversion-webhook-9864564d8-4mggz 0/1 ContainerCreating 5 (40h ago) 47h
longhorn-system longhorn-conversion-webhook-9864564d8-r62rs 0/1 ContainerCreating 0 26m
longhorn-system longhorn-csi-plugin-9nk5s 0/2 ContainerCreating 7 (40h ago) 45h
longhorn-system longhorn-csi-plugin-fdpmn 0/2 ContainerCreating 0 18m
longhorn-system longhorn-csi-plugin-qbs5f 0/2 ContainerCreating 5 (40h ago) 45h
longhorn-system longhorn-driver-deployer-7d4d6d6cb-l7lh8 0/1 PodInitializing 3 (40h ago) 2d3h
longhorn-system longhorn-manager-58mxv 0/1 Init:0/1 0 18m
longhorn-system longhorn-manager-8vnxk 0/1 PodInitializing 2 (40h ago) 45h
longhorn-system longhorn-manager-l4t58 0/1 PodInitializing 2 (40h ago) 45h
longhorn-system longhorn-ui-75646c6c6f-9wdr6 0/1 ContainerCreating 4 (40h ago) 2d3h
registry registry-64c5db47fc-zn7fm 0/1 ContainerCreating 1 37h
Any suggestions?
You should probably run kubectl describe
on those pods and see why the are stuck. Might give you an idea why they are not starting.
Almost there but still no luck. Canal get deployed once I delete weave-net plugin of course! But still nginx and coredns wont start:
NAME READY STATUS RESTARTS AGE
cloud-controller-manager-docker04.mufloland.it 1/1 Running 10 (33m ago) 30h
cloud-controller-manager-docker05.mufloland.it 1/1 Running 12 (33m ago) 2d2h
cloud-controller-manager-docker06.mufloland.it 1/1 Running 10 (33m ago) 30h
etcd-docker04.mufloland.it 1/1 Running 4 (34m ago) 30h
etcd-docker05.mufloland.it 1/1 Running 5 (34m ago) 2d2h
etcd-docker06.mufloland.it 1/1 Running 4 (34m ago) 30h
helm-install-rke2-canal-tvxf4 0/1 Completed 0 4h31m
helm-install-rke2-coredns-r27hw 0/1 CrashLoopBackOff 3 (18s ago) 79s
helm-install-rke2-ingress-nginx-dnwv6 0/1 CrashLoopBackOff 5 (29s ago) 3m46s
helm-install-rke2-metrics-server-k4rk9 0/1 Completed 0 4h25m
kube-apiserver-docker04.mufloland.it 1/1 Running 5 (33m ago) 30h
kube-apiserver-docker05.mufloland.it 1/1 Running 6 (33m ago) 2d2h
kube-apiserver-docker06.mufloland.it 1/1 Running 6 (33m ago) 30h
kube-controller-manager-docker04.mufloland.it 1/1 Running 10 (33m ago) 30h
kube-controller-manager-docker05.mufloland.it 1/1 Running 11 (33m ago) 2d2h
kube-controller-manager-docker06.mufloland.it 1/1 Running 10 (33m ago) 30h
kube-proxy-docker01.mufloland.it 1/1 Running 2 (6m51s ago) 30h
kube-proxy-docker02.mufloland.it 1/1 Running 0 26m
kube-proxy-docker03.mufloland.it 1/1 Running 1 (30m ago) 3h16m
kube-proxy-docker04.mufloland.it 1/1 Running 4 (34m ago) 30h
kube-proxy-docker05.mufloland.it 1/1 Running 5 (34m ago) 2d2h
kube-proxy-docker06.mufloland.it 1/1 Running 4 (34m ago) 30h
kube-scheduler-docker04.mufloland.it 1/1 Running 5 (34m ago) 30h
kube-scheduler-docker05.mufloland.it 1/1 Running 5 (34m ago) 2d2h
kube-scheduler-docker06.mufloland.it 1/1 Running 4 (34m ago) 30h
rke2-canal-79dsh 2/2 Running 2 (6m51s ago) 42m
rke2-canal-8c94w 2/2 Running 4 (26m ago) 42m
rke2-canal-bg289 2/2 Running 2 (30m ago) 42m
rke2-canal-fk4qt 2/2 Running 2 (34m ago) 42m
rke2-canal-j5l5v 2/2 Running 2 (34m ago) 42m
rke2-canal-s6zqm 2/2 Running 2 (34m ago) 42m
rke2-coredns-rke2-coredns-545d64676-7p29k 0/1 Running 0 17s
rke2-coredns-rke2-coredns-545d64676-ls8hv 0/1 Running 0 15s
rke2-coredns-rke2-coredns-autoscaler-6bf4775c97-crszm 1/1 Running 0 17s
rke2-metrics-server-6564db4569-lms4p 1/1 Running 1 (30m ago) 40m
coredns
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m59s default-scheduler Successfully assigned kube-system/helm-install-rke2-coredns-r27hw to docker06.mufloland.it
Normal Pulling 2m59s kubelet Pulling image "rancher/klipper-helm:v0.7.3-build20220613"
Normal Pulled 2m51s kubelet Successfully pulled image "rancher/klipper-helm:v0.7.3-build20220613" in 8.398116625s
Normal Created 64s (x5 over 2m50s) kubelet Created container helm
Normal Started 64s (x5 over 2m50s) kubelet Started container helm
Normal Pulled 64s (x4 over 2m48s) kubelet Container image "rancher/klipper-helm:v0.7.3-build20220613" already present on machine
Warning BackOff 51s (x9 over 2m45s) kubelet Back-off restarting failed container
ingress-nginx
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m default-scheduler Successfully assigned kube-system/helm-install-rke2-ingress-nginx-dnwv6 to docker03.mufloland.it
Normal Pulling 5m59s kubelet Pulling image "rancher/klipper-helm:v0.7.3-build20220613"
Normal Pulled 5m52s kubelet Successfully pulled image "rancher/klipper-helm:v0.7.3-build20220613" in 7.638299786s
Normal Created 4m18s (x5 over 5m52s) kubelet Created container helm
Normal Started 4m18s (x5 over 5m51s) kubelet Started container helm
Normal Pulled 4m18s (x4 over 5m49s) kubelet Container image "rancher/klipper-helm:v0.7.3-build20220613" already present on machine
Warning BackOff 48s (x23 over 5m46s) kubelet Back-off restarting failed container
I think in order to get those pods going (ones starting with helm-install) you'll need to taint your masters so they are schedulable. You can check the pod logs to make sure but I ran into that before. Then once the pods run, some of the other parts should fall into place. Also dont forget afterwards to taint the masters to make them 'un-schedulable'. Something like this:
To get rke-canal and rke-coredns install job pods to run, remove these taints from a master node:
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd-
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane-
Once the kube is 99% stable, re-apply the taints:
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/etcd=true:NoExecute
kubectl taint nodes master1.mydomain.com node-role.kubernetes.io/controlplane=true:NoSchedule`
Thanks for help! rke-canal get installed just fine, rke-coredns and rke2-ingress-nginx helm install just continuously crashes. I've tried removing all the taints from all the master nodes or even leave the taints there, same behaviour. Should I remove the taints only from one of the three master nodes and leave the taints on the others?
Well this is intresting:
Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: IngressClass "nginx" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "rke2-ingress-nginx"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "kube-system"
+ exit
Did you run the migration yaml to remove all the rke1 addons entirely first?
I found that I had to manually remove the old RKE1 components. In my instructions there's a section there titled: "Remove all this from the RKE1 Cluster". All of those things need to be removed and the final step is to remove the ingress-nginx namespace. Also note that within that section there's a subsection titled "CLUSTER SPECIFIC", the names of those pods will be indicative of your cluster...
Now all nodes are up&running just fine. But it's still applying the manifest for canal metrics and coredns in an infinite loop. Should I manually remove the manifests in /var/lib/rancher/rke2/server/manifests/ manually? Also I'm getting a weird error from one of my pods about a failing volume mount (from longhorn):
MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
EDIT: that error was my fault, I had to bind-mount docker socket to get docker plugin in jenkins working!
No way I can get rid of this "migration-agent-addons-remove". Removed the job I just pop up again infinitely, removed from /var/lib/rancher/rke2/server/manifests/ on the 3 master nodes and reboot them has no effect. Removed from Rancher UI in the addons section. It still pop up and fail forever!
Ya, you're right. Even with the pods started properly I still could not get rid of the migration-agent-addons-remove errors either..
Remove the file from the manifests on all controllers. Then kubectl delete
the file from your workstation.
Ouch. If CRDs are unable to be migrated than you might as well do a clean reinstall.
Hi,
when searching for a way to migrate a rancher rke1 cluster to rke2, this is what I found. Isn't there any kind of "official documentation" on this topic? What about the migration-agent (https://github.com/galal-hussein/migration-agent) used here? It seems there is no documentation of it as well, where did you find infos how to use it? And it seems it has not been updated for a very long time, is it still "recommended" to use it?
Unfortunately this has also been a sore spot and disappointment for me and our team. Documentation is sketchy at best and I feel Rancher has completely dropped the ball on the RKE2 migration. We learnt to upgrade to RKE2 simply by brute force try and try again. The migration-agent seems to only reconfigure some of your config files and does not actually perform a migration. Spinning up RKE2 stalls every time it comes across a resource that already exists - so it's up to you to remove all that stuff before hand. The notes above are tedious - but they do seem to work. I hope the Rancher team will have a better approach for RKE3...
Thanks for your answer, I actually saw your howto here, it seems to be our best bet up to now. However, actually if you spin up a new rke2 cluster in Rancher there are many Rancher-side changes aswell, e.g. NodeTemplates are dropped and moved towards fleet-agent using "machine templates". I didn't read anything about this in your howto, will these changes be honored if we do a migration like you did?
@ArthurMcTool where/how did you get the migration-agent? It seems it doesn't compile (anymore?). Do you have the old version you used or is there anywhere you can download the binary?
@Heiko-san - there are links above (Mar 21), search this page for "download". I don't maintain the binaries so hopefully they still work..
Any update on this? @caroline-suse-rancher
Since my previous message seems to have been a bit too rough, here is another attempt at being constructive for those having issues in this migration.
Since we were stuck too, we decided to take another direction and move to Kubespray. A "how to" guide is available at https://github.com/cambierr/rke-to-kubespray/tree/main to do this migration without downtime
At this point, for standalone RKE clusters, https://github.com/rancher/migration-agent achieves most of what this issue is asking for. Migrating rancher-provisioned RKE clusters to RKE2 is not really an RKE2 issue, and there is not any imminent need to migrate from RKE to RKE2, as RKE is still supported at this time.
For the record, we are not currently planning to support in-place conversion of RKE clusters to RKE2. The number of possible edge cases is too high given the wide variety of administrator customizations and user workloads. There is no way to roll back the migration should it cause problems, leading to the potential for critical-severity outages for users that attempt it.
Users should build new RKE2 clusters and migrate individual workloads over. This offers the possibility to find issues before moving into production, and fall back to the untouched existing cluster should problems arise.
This issue shall be used to track the task of researching our options for a migration or upgrade path from rancher/rke to rancher/rke2.