rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 397 forks source link

[v0.20.7+k3s1] TLS handshake error #737

Open SteffenBlake opened 3 years ago

SteffenBlake commented 3 years ago

Version (k3OS / kernel) k3os --version Server: v0.20.7-k3s1r0 Node: v0.20.7-k3s1r0

uname --kernel-release --kernel-version Server: 5.4.51-v8+ #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020 Node: 4.14.5-92 #1 SMP PREEMPT Mon Dec 11 15:48:15 UTC 2017

Architecture uname --machine Server: aarch64 Agent: armv7l

kubectl version

Server: Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7+k3s1", GitCommit:"aa768cbdabdb44c95c5c1d9562ea7f5ded073bc0", GitTreeState:"clean", BuildDate:"2021-05-20T00:57:15Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/arm64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7+k3s1", GitCommit:"aa768cbdabdb44c95c5c1d9562ea7f5ded073bc0", GitTreeState:"clean", BuildDate:"2021-05-20T00:57:15Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/arm64"}

Agent: Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7+k3s1", GitCommit:"aa768cbdabdb44c95c5c1d9562ea7f5ded073bc0", GitTreeState:"clean", BuildDate:"2021-05-20T01:01:56Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/arm"}

Hardware: Server: Raspberry Pi 4 Agent: Odroid HC2

Describe the bug Raspberry Pi 4 configured as the Control Plane Server for my cluster, Odroid HC2 configured as an Agent.

When the HC2 Agent boots up and starts attempting to connect, I get this following error message from /var/log/k3s-service.log on the Control Plane

time="2021-08-16T21:24:08.866616712Z" level=info msg="Cluster-Http-Server 2021/08/16 21:24:08 http: TLS handshake error from 192.168.0.32:50122: remote error: tls: bad certificate"
time="2021-08-16T21:24:08.913572392Z" level=info msg="Cluster-Http-Server 2021/08/16 21:24:08 http: TLS handshake error from 192.168.0.32:50134: remote error: tls: bad certificate"
time="2021-08-16T21:24:09.772299431Z" level=error msg="unable to verify hash for node 'haruka': hash does not match"

This error repeats over and over every few seconds.

Expected behavior HC2 Agent to successfully establish a connection with the Control Plane

Actual behavior Control Plane reports errors regarding a bad certificate and fails to establish a connection

Additional context I have several other Raspberry Pi 4s and B+ models running on the cluster as agents, and successfully establishing connections with the Control Plane

Both the Control Plane and the HC2 Agent are able to ping each other's IPs successfully.

The HC2 Agent has the exact same configuration as the other Raspberry Pi agents.

SteffenBlake commented 3 years ago

Tested same setup on various release versions:

v0.21.1-k3s1r(Prerelease) - Same errors v0.20.7-k3s1r0 - Same errors v0.20.6-k3s1r0 - Same errors v0.19.11-k3s1r0 - Working v0.19.8-k3s1r0 - Working v0.11.0 - Working

dweomer commented 3 years ago

It is not atypical for ARM boards to lack hardware clocks which means that they must get their time, during start-up, from an external source. It is possible that one or both of your boards are not getting the time from the network (or that they simply disagree by 5 or more minutes on the current time the swclock persists between boots). Connman is supposed to take care of this but sometimes fails miserably. I find this snippet to work reliably on my NUC (amd64) and RPI4 (aarch64) test hardware:

run_cmd:
- rc-service ntpd start

Because of the async manner in which clock skew is corrected, race conditions are still possible when setting up a cluster (k3s can start up before the clock skew has been fully corrected). If you are only setting up a single server + a worker you should find that once the clocks agree then the worker will be able to successfully join the cluster. A reboot after wiping /var/lib/rancher/k3s on the worker may be necessary.

dweomer commented 3 years ago

Tested same setup on various release versions:

v0.21.1-k3s1r(Prerelease) - Same errors v0.20.7-k3s1r0 - Same errors v0.20.6-k3s1r0 - Same errors v0.19.11-k3s1r0 - Working v0.19.8-k3s1r0 - Working v0.11.0 - Working

Oh, shoot. I skimmed this and missed some detail. Doesn't blow my theory out of the water but does indicate that maybe something else is going on. Hmm. Still worth making sure that the clocks on both machines agree.