Closed agilob closed 3 years ago
Removing a kubernetes node using kubectl is not supposed to clean up the files generated by k3s, to fully uninstall k3s from a node you might want to use /usr/local/bin/k3s-uninstall.sh
script that should be installed in the system.
To improve user experience kubectl should remove hostname:password from /var/lib/rancher/k3s/server/cred/node-passwd
when node is deleted? As it was my first time with KxS it took me a while to figure out where the password is stored and why it's not removed. I'm happy to close it if you disagree, at least this will be some help to other users.
It probably should, we are cleaning up CoreDNS hosts entry here: https://github.com/rancher/k3s/blob/36ca6060733725953b7a4cd2b53a295d11aea684/pkg/node/controller.go#L36
The issue isn't with cleaning up the node, it is with cleaning up node-passwd
on the server.
In my case, I uninstalled (via script) and removed the node via kubectl
. Then upon reinstall this issue popped up.
Uninstalling again and then removing the entry from {data-dir}/server/cred/node-passwd
(default /var/lib/rancher/k3s/server/cred/node-passwd) worked for me.
@ibuildthecloud I ran into this issue too, and it was really confusing.
Uninstall k3s-agent / reinstall - no effect Eventually the logs of k3s-agent on the node got me here to this error.
Just to add a comment in support of doing this cleanup.
I set up a clean install of k3s on 5 raspberry pi 4s.
Unfortunately, I had to reimage the OS completely on my last node (hostname: hive-node-4). After I got it all set up again and got the node to join via k3sup, I noticed it was never actually joining even though the install was fresh. So the above instructions to run uninstall make no sense.
I'm running kubectl from my laptop with a KUBECONFIG set and trying to get the new hive-node-4 into the cluster. But the duplicate hostname causes this issue. There definitely needs to be a better way to clean this up. I don't think reusing a hostname is an uncommon thing.
This cleanup would allow my infrastructure to be far more immutable. My first intention was having my nodes automatically join the cluster on first boot, but this causes issues after a reimage.
Can something be done via kubectl node delete
?
I think the node-controller might API might be ok for removing the entry from the file.
The docs suggest that this is done via a cloud controller .
https://kubernetes.io/docs/concepts/architecture/controller/#direct-control
Note: Possible RKE2 impact? Was disabled in RKE2? Investigate as this gets addressed.
Facing the issue mentioned by @thebouv here, deleted nodes from cluster and reused hostnames with fresh VMs. But all i got is
time="2020-10-30T04:28:32.280855039Z" level=error msg="Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
i solved it by removing the old node entries
ssh into the master node
sudo vi /var/lib/rancher/k3s/server/cred/node-passwd
delete the deprecated node entry then save and a few seconds later you will get your new node in the cluster
@erikwilson I don't see a backport PR for this into 1.19 branch. This is just for 1.20? We're already working on shipping 1.19.5 out so I'm bumping this out.
@erikwilson An issue to cover RKE2 was opened here as well: https://github.com/rancher/rke2/issues/616 Not sure what this entails - like if it's just a pull-thru PR or requires more work. However, we'd like to also get this fixed in RKE2 as well for our next patch release there as well (planned for 1/13/21)
Reproduced the issue using k3s version v1.19.5+k3s1, new node with same hostname cannot be joined after node with the hostname was deleted
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-16-236 Ready <none> 9m30s v1.19.5+k3s1
ip-172-31-29-156 Ready master 14m v1.19.5+k3s1
ubuntu@ip-172-31-29-156:~$ kubectl delete node ip-172-31-16-236
node "ip-172-31-16-236" deleted
k3s-master
Dec 17 07:49:42 ip-172-31-29-156 k3s[2150]: time="2020-12-17T07:49:42.676275874Z" level=error msg="Node password validation failed for 'ip-172-31-16-236', using passwd file '/var/lib/rancher/k3s/server/cred/node-passwd'"
k3s-agent
Dec 17 07:03:50 ip-172-31-16-236 k3s[2782]: time="2020-12-17T07:03:50.833861337Z" level=error msg="Failed to retrieve agent config: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Validated node with same hostname can be joined after being deleted. On k3s version v1.20.0-rc4+k3s1
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-172-31-16-217 Ready control-plane,master 27m v1.20.0-rc4+k3s1 172.31.16.217 <none> Ubuntu 20.04.1 LTS 5.4.0-1029-aws containerd://1.4.3-k3s1
ip-172-31-17-227 Ready <none> 66s v1.20.0-rc4+k3s1 172.31.17.227 <none> Ubuntu 20.04.1 LTS 5.4.0-1029-aws containerd://1.4.3-k3s1
kubectl delete node ip-172-31-17-227
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-172-31-16-217 Ready control-plane,master 28m v1.20.0-rc4+k3s1 172.31.16.217 <none> Ubuntu 20.04.1 LTS 5.4.0-1029-aws containerd://1.4.3-k3s1
ip-172-31-17-227 Ready <none> 38s v1.20.0-rc4+k3s1 172.31.25.36 <none> Ubuntu 20.04.1 LTS 5.4.0-1029-aws containerd://1.4.3-k3s1
Noticed on the node that was added, logs show below error msg every 10 seconds.
Dec 21 16:34:50 ip-172-31-12-56 k3s[2171]: E1221 16:34:50.936837 2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Dec 21 16:35:00 ip-172-31-12-56 k3s[2171]: E1221 16:35:00.964220 2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Dec 21 16:35:11 ip-172-31-12-56 k3s[2171]: E1221 16:35:11.037132 2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Above error is seen as the node that was deleted in kubernetes was still running. Shutting the old node down resolved the msg seen in logs.
Steps followed.
any backport on k3s 1.19 ? (k3s 1.20 is not yet compatible with rancher .. )
If all goes as planned, upcoming Rancher 2.5.6 will support 1.20
Hi, I have one similar issue. With message : "Failed to retrieve agent config: Node password my PBS is because a réuse hostname :( When I and -with-node-id on services the node can join cluster
on /etc/systemd/system/k3s-agent.service .... ExecStart=/usr/local/bin/k3s \ agent --with-node-id \
I think I have the same issue:
"Failed to retrieve agent config: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling...
Fresh K3S installation on both nodes.
Yes, it's still reproducible @wsdt
Do the hosts have unique hostnames? If reinstalling on a host, are you deleting the node from the cluster before reinstalling?
Do the hosts have unique hostnames?
Yes.
If reinstalling on a host, are you deleting the node from the cluster before reinstalling?
Yes.
Can you share the steps you are running to reproduce this? I'm not sure how you would have an existing node password entry on a fresh install - there must be some data left behind from a previous installation.
Can you share the steps you are running to reproduce this?
Setup cluster with
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--tls-san rpi4.lan" sh -
after a few days node can't reconnect to master node, I get an error described in issue 1452 (not linking as they aren't related). So I remove worker node with k delete node rpi3
. login to rpi3 (the failing node) I run killall k3s script, I run uninstall k3s-agent script, reboot rpi3.
I readd rpi3 node with curl -sfL https://get.k3s.io | K3S_URL=https://rpi4.lan:6443 K3S_TOKEN=<<token>> sh -
and getting the error above.
Good news, you can keep deleting the node rpi3 and rebooting master and worker nodes and at some point it will readd node without that error.
Can you provide logs from the k3s server around the timeframe that you are deleting and attempting to rejoin the node? If rebooting fixes the issue, I suspect that the server may just be performing poorly and not recognizing the node deletion quickly enough.
@bradtopol I just reformatted both nodes and installed everything again. Now it works. Thus, the issue should have been on my side :-)
@brandond I might be wrong, but the server was saying that node was attempting to connect with incorrect token (or password). i verified password manually in (/etc/rancher/node/password
), got new registration token from /var/lib/rancher/k3s/server/node-token
and it still couldnt connect
@wsdt on my rpis the problem comes back every now and then or after a few reboots. can you check k3s-server logs if it's back?
The node password is not the same as the registration token. I think it's linked above, but please take a look at https://rancher.com/docs/k3s/latest/en/architecture/#how-agent-node-registration-works
@agilob Oh hopefully it doesn't come back 🤪 but yeah will keep you updated in that case.
I experienced a very similar issue:
kubectl -n kube-system delete secrets <agent-node-name>.node-password.k3s
.* kubectl -n kube-system delete secrets <agent-node-name>.node-password.k3s
Many thanks, this solved the problem for me 👍🏻
Just to add my solution to this issue, make sure that you don't have same hostnames for different machines. It was my case, so changing the hostname and reinstalling agent fixed the problem.
Just to add my solution to this issue, make sure that you don't have same hostnames for different machines. It was my case, so changing the hostname and reinstalling agent fixed the problem.
Oh man, just found your comment as I wanted to post exactly that problem/solution. ;)
Solution: Different hostnames hostnamectl set-hostname node1
* kubectl -n kube-system delete secrets <agent-node-name>.node-password.k3s
Many thanks, this solved the problem for me 👍🏻
thanks for posting it out, i had to reinstall the os in my agent pi, i had deleted the node from the cluster, i had some issue with the master pi which prevented it from cleaning everything i guess... i manually deleted the second node from everywhere i could find and was not aware that there was a secret with a password lol
I'm not sure if that's the right place for the bug report, because the error message I got has only one google results and it's pointing to commit message that added password validation below, so here it is.
I have a few rpis: 1, 0W. I installed hypriot on them and after installation of k3s changed some of their hostnames. I changed hostname of
black-pearl
torpi1
, removedblack-pearl
node from k3s-server, created anotherblack-pearl
on RPI 0W and here comes the problem, k3s ofrpi0
(black-pearl) couldn't join cluster because password didn't match:k3s-agent:
I spent some time trying to fix it and noticed that old password for
black-pearl
which is nowrpi1
is still in/var/lib/rancher/k3s/server/cred/node-passwd
despite runningkubectl delete black-pearl
.Seems that removing node should also remove password for that node in case another node with same hostname (OS is reinstalled?) re-joins the cluster.