k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
26.83k stars 2.26k forks source link

Removing node doesn't remove node password #802

Closed agilob closed 3 years ago

agilob commented 4 years ago

I'm not sure if that's the right place for the bug report, because the error message I got has only one google results and it's pointing to commit message that added password validation below, so here it is.

I have a few rpis: 1, 0W. I installed hypriot on them and after installation of k3s changed some of their hostnames. I changed hostname of black-pearl to rpi1, removed black-pearl node from k3s-server, created another black-pearl on RPI 0W and here comes the problem, k3s of rpi0 (black-pearl) couldn't join cluster because password didn't match:

k3s-agent:

level=info msg="Running load balancer 127.0.0.1:41241 ->[k3s.local:6443]"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"
level=error msg="Node password rejected, contents of '/var/lib/rancher/k3s/agent/node-password.txt' may not match server passwd entry"

I spent some time trying to fix it and noticed that old password for black-pearl which is now rpi1 is still in /var/lib/rancher/k3s/server/cred/node-passwd despite running kubectl delete black-pearl.

Seems that removing node should also remove password for that node in case another node with same hostname (OS is reinstalled?) re-joins the cluster.

galal-hussein commented 4 years ago

Removing a kubernetes node using kubectl is not supposed to clean up the files generated by k3s, to fully uninstall k3s from a node you might want to use /usr/local/bin/k3s-uninstall.sh script that should be installed in the system.

agilob commented 4 years ago

To improve user experience kubectl should remove hostname:password from /var/lib/rancher/k3s/server/cred/node-passwd when node is deleted? As it was my first time with KxS it took me a while to figure out where the password is stored and why it's not removed. I'm happy to close it if you disagree, at least this will be some help to other users.

erikwilson commented 4 years ago

It probably should, we are cleaning up CoreDNS hosts entry here: https://github.com/rancher/k3s/blob/36ca6060733725953b7a4cd2b53a295d11aea684/pkg/node/controller.go#L36

The issue isn't with cleaning up the node, it is with cleaning up node-passwd on the server.

maxirus commented 4 years ago

In my case, I uninstalled (via script) and removed the node via kubectl. Then upon reinstall this issue popped up.

Uninstalling again and then removing the entry from {data-dir}/server/cred/node-passwd (default /var/lib/rancher/k3s/server/cred/node-passwd) worked for me.

alexellis commented 4 years ago

@ibuildthecloud I ran into this issue too, and it was really confusing.

Uninstall k3s-agent / reinstall - no effect Eventually the logs of k3s-agent on the node got me here to this error.

thebouv commented 3 years ago

Just to add a comment in support of doing this cleanup.

I set up a clean install of k3s on 5 raspberry pi 4s.

Unfortunately, I had to reimage the OS completely on my last node (hostname: hive-node-4). After I got it all set up again and got the node to join via k3sup, I noticed it was never actually joining even though the install was fresh. So the above instructions to run uninstall make no sense.

I'm running kubectl from my laptop with a KUBECONFIG set and trying to get the new hive-node-4 into the cluster. But the duplicate hostname causes this issue. There definitely needs to be a better way to clean this up. I don't think reusing a hostname is an uncommon thing.

David-Igou commented 3 years ago

This cleanup would allow my infrastructure to be far more immutable. My first intention was having my nodes automatically join the cluster on first boot, but this causes issues after a reimage.

ieugen commented 3 years ago

Can something be done via kubectl node delete? I think the node-controller might API might be ok for removing the entry from the file. The docs suggest that this is done via a cloud controller . https://kubernetes.io/docs/concepts/architecture/controller/#direct-control

davidnuzik commented 3 years ago

Note: Possible RKE2 impact? Was disabled in RKE2? Investigate as this gets addressed.

ulm0 commented 3 years ago

Facing the issue mentioned by @thebouv here, deleted nodes from cluster and reused hostnames with fresh VMs. But all i got is

time="2020-10-30T04:28:32.280855039Z" level=error msg="Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
alex77g commented 3 years ago

i solved it by removing the old node entries ssh into the master node sudo vi /var/lib/rancher/k3s/server/cred/node-passwd delete the deprecated node entry then save and a few seconds later you will get your new node in the cluster

davidnuzik commented 3 years ago

@erikwilson I don't see a backport PR for this into 1.19 branch. This is just for 1.20? We're already working on shipping 1.19.5 out so I'm bumping this out.

davidnuzik commented 3 years ago

@erikwilson An issue to cover RKE2 was opened here as well: https://github.com/rancher/rke2/issues/616 Not sure what this entails - like if it's just a pull-thru PR or requires more work. However, we'd like to also get this fixed in RKE2 as well for our next patch release there as well (planned for 1/13/21)

ShylajaDevadiga commented 3 years ago

Reproduced the issue using k3s version v1.19.5+k3s1, new node with same hostname cannot be joined after node with the hostname was deleted

kubectl get nodes 
NAME               STATUS   ROLES    AGE     VERSION
ip-172-31-16-236   Ready    <none>   9m30s   v1.19.5+k3s1
ip-172-31-29-156   Ready    master   14m     v1.19.5+k3s1
ubuntu@ip-172-31-29-156:~$ kubectl delete node ip-172-31-16-236 
node "ip-172-31-16-236" deleted

k3s-master
Dec 17 07:49:42 ip-172-31-29-156 k3s[2150]: time="2020-12-17T07:49:42.676275874Z" level=error msg="Node password validation failed for 'ip-172-31-16-236', using passwd file '/var/lib/rancher/k3s/server/cred/node-passwd'"

k3s-agent

Dec 17 07:03:50 ip-172-31-16-236 k3s[2782]: time="2020-12-17T07:03:50.833861337Z" level=error msg="Failed to retrieve agent config: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

Validated node with same hostname can be joined after being deleted. On k3s version v1.20.0-rc4+k3s1

kubectl get nodes -o wide
NAME               STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
ip-172-31-16-217   Ready    control-plane,master   27m   v1.20.0-rc4+k3s1   172.31.16.217   <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1
ip-172-31-17-227   Ready    <none>                 66s   v1.20.0-rc4+k3s1   172.31.17.227   <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1

kubectl delete node ip-172-31-17-227

kubectl get nodes -o wide
NAME               STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
ip-172-31-16-217   Ready    control-plane,master   28m   v1.20.0-rc4+k3s1   172.31.16.217   <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1
ip-172-31-17-227   Ready    <none>                 38s   v1.20.0-rc4+k3s1   172.31.25.36    <none>        Ubuntu 20.04.1 LTS   5.4.0-1029-aws   containerd://1.4.3-k3s1
ShylajaDevadiga commented 3 years ago

Noticed on the node that was added, logs show below error msg every 10 seconds.

Dec 21 16:34:50 ip-172-31-12-56 k3s[2171]: E1221 16:34:50.936837    2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Dec 21 16:35:00 ip-172-31-12-56 k3s[2171]: E1221 16:35:00.964220    2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
Dec 21 16:35:11 ip-172-31-12-56 k3s[2171]: E1221 16:35:11.037132    2171 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "ip-172-31-12-56": the object has been modified; please apply your changes to the latest version and try again
ShylajaDevadiga commented 3 years ago

Above error is seen as the node that was deleted in kubernetes was still running. Shutting the old node down resolved the msg seen in logs.

Steps followed.

poblin-orange commented 3 years ago

any backport on k3s 1.19 ? (k3s 1.20 is not yet compatible with rancher .. )

davidnuzik commented 3 years ago

If all goes as planned, upcoming Rancher 2.5.6 will support 1.20

gmoulard commented 3 years ago

Hi, I have one similar issue. With message : "Failed to retrieve agent config: Node password my PBS is because a réuse hostname :( When I and -with-node-id on services the node can join cluster

on /etc/systemd/system/k3s-agent.service .... ExecStart=/usr/local/bin/k3s \ agent --with-node-id \

wsdt commented 3 years ago

I think I have the same issue: "Failed to retrieve agent config: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling...

Fresh K3S installation on both nodes.

agilob commented 3 years ago

Yes, it's still reproducible @wsdt

brandond commented 3 years ago

Do the hosts have unique hostnames? If reinstalling on a host, are you deleting the node from the cluster before reinstalling?

agilob commented 3 years ago

Do the hosts have unique hostnames?

Yes.

If reinstalling on a host, are you deleting the node from the cluster before reinstalling?

Yes.

brandond commented 3 years ago

Can you share the steps you are running to reproduce this? I'm not sure how you would have an existing node password entry on a fresh install - there must be some data left behind from a previous installation.

wsdt commented 3 years ago

I followed this setup: https://kauri.io/#collections/Build%20your%20very%20own%20self-hosting%20platform%20with%20Raspberry%20Pi%20and%20Kubernetes/%2838%29-install-and-configure-a-kubernetes-cluster-w/

agilob commented 3 years ago

Can you share the steps you are running to reproduce this?

Setup cluster with

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--tls-san rpi4.lan" sh -

after a few days node can't reconnect to master node, I get an error described in issue 1452 (not linking as they aren't related). So I remove worker node with k delete node rpi3. login to rpi3 (the failing node) I run killall k3s script, I run uninstall k3s-agent script, reboot rpi3.

I readd rpi3 node with curl -sfL https://get.k3s.io | K3S_URL=https://rpi4.lan:6443 K3S_TOKEN=<<token>> sh - and getting the error above.

Good news, you can keep deleting the node rpi3 and rebooting master and worker nodes and at some point it will readd node without that error.

brandond commented 3 years ago

Can you provide logs from the k3s server around the timeframe that you are deleting and attempting to rejoin the node? If rebooting fixes the issue, I suspect that the server may just be performing poorly and not recognizing the node deletion quickly enough.

wsdt commented 3 years ago

@bradtopol I just reformatted both nodes and installed everything again. Now it works. Thus, the issue should have been on my side :-)

agilob commented 3 years ago

@brandond I might be wrong, but the server was saying that node was attempting to connect with incorrect token (or password). i verified password manually in (/etc/rancher/node/password), got new registration token from /var/lib/rancher/k3s/server/node-token and it still couldnt connect

@wsdt on my rpis the problem comes back every now and then or after a few reboots. can you check k3s-server logs if it's back?

brandond commented 3 years ago

The node password is not the same as the registration token. I think it's linked above, but please take a look at https://rancher.com/docs/k3s/latest/en/architecture/#how-agent-node-registration-works

wsdt commented 3 years ago

@agilob Oh hopefully it doesn't come back 🤪 but yeah will keep you updated in that case.

tempestrock commented 3 years ago

I experienced a very similar issue:

mannp commented 3 years ago
* kubectl -n kube-system delete secrets <agent-node-name>.node-password.k3s

Many thanks, this solved the problem for me 👍🏻

AlexisTonneau commented 2 years ago

Just to add my solution to this issue, make sure that you don't have same hostnames for different machines. It was my case, so changing the hostname and reinstalling agent fixed the problem.

nickma82 commented 1 year ago

Just to add my solution to this issue, make sure that you don't have same hostnames for different machines. It was my case, so changing the hostname and reinstalling agent fixed the problem.

Oh man, just found your comment as I wanted to post exactly that problem/solution. ;) Solution: Different hostnames hostnamectl set-hostname node1

migs35323 commented 7 months ago
* kubectl -n kube-system delete secrets <agent-node-name>.node-password.k3s

Many thanks, this solved the problem for me 👍🏻

thanks for posting it out, i had to reinstall the os in my agent pi, i had deleted the node from the cluster, i had some issue with the master pi which prevented it from cleaning everything i guess... i manually deleted the second node from everywhere i could find and was not aware that there was a secret with a password lol