Closed recipedude closed 3 years ago
You don't need to set K3S_URL (--server) when using an external datastore; this is only for use when joining agents or using embedded etcd.
I am curious how you came to have two nodes with the control-plane role label. This wasn't added until 1.20, yet your nodes are all still on 1.19. Did you upgrade temporarily, and then downgrade again?
In the past I have seen behavior like this when servers were all brought up at the same time and raced to bootstrap the cluster CA certs, or when nodes were started up with existing certs from a different cluster that they then try to use instead of the ones recognized by the rest of the cluster.
It sounds like these nodes have been through some odd things. I run my personal cluster with an external etcd and haven't had any problems with it; I suspect something in the way you started up, upgrade, or grew this cluster has left it very confused about what certificates to use.
You don't need to set K3S_URL
Added the KS3_URL to see if that would make any difference. Started second guessing and wondering if the 2nd/3rd nodes were fighting with the first then added K3S_URL wondering if that was the missing piece but didn't seem to make a difference or change any behavior. Nice to get some clarification that it's only agent nodes that need it.
curious how you came to have two nodes with the control-plane role label
Was wondering about that as well. The control-plane label comes and goes. At one point I saw nodes reporting 1.20 in the get nodes output even though INSTALL_K3S_VERSION=v1.19.7+k3s1
env var was set which felt weird. There was an initial upgrade run without that env var set a couple of days ago and the nodes all ended up on v1.20+ but then Rancher refused to install so I uninstalled each node and, added the env var to downgrade so that rancher would install.
That got everything running again - but then the next day Unable to connect to the server: x509: certificate signed by unknown authority
seen behavior like this when servers were all brought up at the same time
Makes sense. Am starting/upgrading/uninstalling nodes one by one through this process so doubt that was happening on these re-installs. Although, the initial failure was due to the physical server (it's a KVM linux box with each node being a VM) being shut down gracefully and restarted so the nodes would have been re-starting pretty much at the same time when the box was powered back up.
sounds like these nodes have been through some odd things.
Feels the same. Initial installation last March, testing for a week (k3s+rancher) - then they just lay there idling until powering down the box and powering back up and the cluster was broken.
FYI, the original nodes running this cluster have all been deleted and replaced with branch new fresh 'n clean VMs hoping to purge any weirdness.
It's disconcerting that there doesn't seem to be a path to recover this sick cluster. There's no way I would feel confident going into production if a reboot (graceful or otherwise) could throw things into an unrecoverable state.
Where exactly is the configuration for the certificate(s) on each node located?
The control-plane label comes and goes. At one point I saw nodes reporting 1.20 in the get nodes output even though INSTALL_K3S_VERSION=v1.19.7+k3s1 env var was set which felt weird.
That variable is only used by the install script. Do you somehow have something running that is reinstalling and restarting K3s? It doesn't self-update, although the system-upgrade-controller (available through Rancher) will create jobs to do rolling updates to the cluster. Is that something you were playing with at some point? Do you perhaps have multiple nodes (old VMs or something) with duplicate hostnames pointed at the etcd datastore?
It's disconcerting that there doesn't seem to be a path to recover this sick cluster.
At the very least you should be able to stabilize the cluster by going down to a single server node so that they're not all arguing about certificates. Just uninstall all but one, delete the nodes, make sure the local disk is clean, then reinstall one at a time. This assumes you figure out what it is that's changing your versions...
Where exactly is the configuration for the certificate(s) on each node located?
/var/lib/rancher/k3s/server/tls/
Do you somehow have something running that is reinstalling and restarting K3s?
Hmm, the VMs are deployed with chef... manually walked through the chef deployment and not seeing anywhere it's running or restart k3s best I can tell - to be certain I did comment out the k3s install section where it downloads the k3s.install
script to make sure it's not the culprit.
system-upgrade-controller
Haven't played with that at all. Overall it's very plain jane and not much experimentation beyond installing Rancher on top of k3s.
you should be able to stabilize the cluster by going down to a single server node
Great idea, have done exactly that. Now running a freshly re-installed single node and keeping an eye on the timestamps in /var/lib/rancher/k3s/server/tls/
hoping that will at least tell me if those certs are getting changed by anything.
The kube-system
pods aren't happy and the now single node cluster is limping along in poor shape.
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-cm94q 0/1 Running 0 68m
helm-install-traefik-cxbm2 0/1 CrashLoopBackOff 18 70m
svclb-traefik-4t79s 2/2 Running 0 83m
traefik-6f9cbd9bd4-f9r92 1/1 Running 0 81m
helm-install-traefik is crashlooping and coredns is stuck with this in the logs:
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0209 20:54:55.692578 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:55.694071 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get "https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:55.694896 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:56.694804 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:56.695503 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get "https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:56.696994 1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Have kubectl delete -f /var/lib/rancher/k83/server/manifests/coredns.yaml
and re-applied hoping it would clean itself up but no luck there so far.
Not sure why it's continuing to internally whinge about the certs now that it's a single node.
You might try deleting the crashlooping pod so that it can be recreated with the correct cluster CA certs.
:) I have deleted those mis-behaving pods dozens of times now. As well as deleted the manifests entirely then re-apply them once confirmed the pods were indeed destroyed.
# kg -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-cm94q 0/1 Running 0 130m
helm-install-traefik-cxbm2 0/1 CrashLoopBackOff 30 133m
svclb-traefik-4t79s 2/2 Running 0 145m
traefik-6f9cbd9bd4-f9r92 1/1 Running 0 144m
[root@k3s-ya-1 server]# k -n kube-system delete pod helm-install-traefik-cxbm2
pod "helm-install-traefik-cxbm2" deleted
[root@k3s-ya-1 server]# kg -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-cm94q 0/1 Running 0 131m
helm-install-traefik-6c9jx 0/1 CrashLoopBackOff 2 39s
svclb-traefik-4t79s 2/2 Running 0 146m
traefik-6f9cbd9bd4-f9r92 1/1 Running 0 145m
Did you delete the Helm Job that the pod is coming from?
Poking at that crashlooping pod some more:
# k -n kube-system logs helm-install-traefik-6c9jx
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
--snip--
[storage] 2021/02/09 21:57:16 listing all releases with filter
[storage/driver] 2021/02/09 21:57:16 list: failed to list: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%3DTILLER": x509: certificate signed by unknown authority
Error: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%!D(MISSING)TILLER": x509: certificate signed by unknown authority
--snip--
chart path is a url, skipping repo update
Error: no repositories configured
--snip--
Error: Kubernetes cluster unreachable: Get "https://10.43.0.1:443/version?timeout=32s": x509: certificate signed by unknown authority
--snip--
+ helm_v3 install traefik https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz --values /config/values-01_HelmChart.yaml
Error: failed to download "https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz" (hint: running `helm repo update` may help)
Did you delete the Helm Job that the pod is coming from?
Pretty sure it's been deleted. Am assuming that kubectl delete -f traefik.yaml
(in the /var/lib/rancher/k3s/server/manifests
folder) would delete the helm job. That does seem to completely kill off that pod.
# kdf traefik.yaml
helmchart.helm.cattle.io "traefik" deleted
[root@k3s-ya-1 manifests]# kg -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-cm94q 0/1 Running 0 139m
svclb-traefik-4t79s 2/2 Running 0 154m
traefik-6f9cbd9bd4-f9r92 1/1 Running 0 153m
[root@k3s-ya-1 manifests]# kaf traefik.yaml
helmchart.helm.cattle.io/traefik created
[root@k3s-ya-1 manifests]# kg -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-cm94q 0/1 Running 0 139m
helm-install-traefik-4x847 0/1 CrashLoopBackOff 1 11s
svclb-traefik-4t79s 2/2 Running 0 154m
traefik-6f9cbd9bd4-f9r92 1/1 Running 0 153m
OK this bit is interesting:
- cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
- update-ca-certificates WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
Somehow the cluster CA has multiple certs in it? Can you cat /var/lib/rancher/k3s/server/tls/server-ca.crt
on the server?
Yup - have been eyeing that same log message - multiple certs!? What the...
# cat /var/lib/rancher/k3s/server/tls/server-ca.crt
-----BEGIN CERTIFICATE-----
MIIBdzCCAR2gAwIBAgIBADAKBggqhkjOPQQDAjAjMSEwHwYDVQQDDBhrM3Mtc2Vy
dmVyLWNhQDE2MTI2NDE0OTcwHhcNMjEwMjA2MTk1ODE3WhcNMzEwMjA0MTk1ODE3
WjAjMSEwHwYDVQQDDBhrM3Mtc2VydmVyLWNhQDE2MTI2NDE0OTcwWTATBgcqhkjO
PQIBBggqhkjOPQMBBwNCAATgE2WSc1B+7yNB3IOxahlI80B+uDNqtQ2OG+shRQtd
uuN3ehchBXgZ/7EzmT5QzKD/OWxgDs6D7GGrHfCRzH+so0IwQDAOBgNVHQ8BAf8E
BAMCAqQwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUNguwKhD0HcYEZGwVvs3K
d1XfuGkwCgYIKoZIzj0EAwIDSAAwRQIgW3K54s1DChzOJllhZMBhrBv+zFsmGjg+
/TthN/1Z6U0CIQDu0BZo11CYar1F5h9gyfRLspMLxglCKtXrCwMgHYq2yQ==
-----END CERTIFICATE-----
Looks like a single cert to me. This is so weird - but it certainly is a crash course in troubleshooting k3s!
Oh, it turns out that WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
is an upstream issue from Alpine, unrelated to the CA cert we are dropping. I still think that the error is related to the service account somehow. Can you try:
kubectl delete serviceaccount -n kube-system helm-traefik
kubectl delete job -n helm-system helm-install-traefik
# kubectl delete serviceaccount -n kube-system helm-traefik
serviceaccount "helm-traefik" deleted
# kubectl delete job -n helm-system helm-install-traefik
Error from server (NotFound): jobs.batch "helm-install-traefik" not found
# kubectl delete -f traefik.yaml
helmchart.helm.cattle.io "traefik" deleted
Still seeing some traefik pods running though which makes me wonder why they're still there.
# kg -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-cm94q 0/1 Running 0 3h53m
svclb-traefik-4t79s 2/2 Running 0 4h8m
traefik-6f9cbd9bd4-f9r92 1/1 Running 0 4h7m
And same crashloop when re-applying traefik.yaml...
So I also deleted the traefik deployment and the svclib-traefik (sp?) daemonset, redeleted the traefik serviceccount. the traefik service and traefik-prometheus and coredns just for good measure.
Ended up here:
# k -n kube-system get pods,svc,ds,rs,deploy,jobs,ingress
Warning: extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
No resources found in kube-system namespace.
Alrighty, looks like kube-system is more or less fully nuked.
# kaf traefik.yaml
helmchart.helm.cattle.io/traefik created
# kg -n kube-system
NAME READY STATUS RESTARTS AGE
helm-install-traefik-njn8m 0/1 CrashLoopBackOff 2 35s
Darn.
Here's output of describe on that pod.
# k -n kube-system describe pod helm-install-traefik-njn8m
Name: helm-install-traefik-njn8m
Namespace: kube-system
Priority: 0
Node: k3s-ya-1/10.1.0.83
Start Time: Tue, 09 Feb 2021 18:52:17 -0500
Labels: controller-uid=9d44a166-f197-4800-8d9d-f6f81113ccdd
helmcharts.helm.cattle.io/chart=traefik
job-name=helm-install-traefik
Annotations: helmcharts.helm.cattle.io/configHash: SHA256=54DADC5C41A9E92996BEB90979244F7E4F0D86B23C3F54AAF5BBC497C412496E
Status: Running
IP: 10.42.1.205
IPs:
IP: 10.42.1.205
Controlled By: Job/helm-install-traefik
Containers:
helm:
Container ID: containerd://cca390ef69eb6a7b2bc3a7caca0d6b159ffd558e8d840d0814421d3fac6e720c
Image: rancher/klipper-helm:v0.4.3
Image ID: docker.io/rancher/klipper-helm@sha256:b319bce4802b8e42d46e251c7f9911011a16b4395a84fa58f1cf4c788df17139
Port: <none>
Host Port: <none>
Args:
install
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 09 Feb 2021 18:53:46 -0500
Finished: Tue, 09 Feb 2021 18:53:46 -0500
Ready: False
Restart Count: 4
Environment:
NAME: traefik
VERSION:
REPO:
HELM_DRIVER: secret
CHART_NAMESPACE: kube-system
CHART: https://%{KUBERNETES_API}%/static/charts/traefik-1.81.0.tgz
HELM_VERSION:
TARGET_NAMESPACE: kube-system
NO_PROXY: .svc,.cluster.local,10.42.0.0/16,10.43.0.0/16
Mounts:
/chart from content (rw)
/config from values (rw)
/var/run/secrets/kubernetes.io/serviceaccount from helm-traefik-token-jwpr6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
values:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: chart-values-traefik
Optional: false
content:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: chart-content-traefik
Optional: false
helm-traefik-token-jwpr6:
Type: Secret (a volume populated by a Secret)
SecretName: helm-traefik-token-jwpr6
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 108s default-scheduler Successfully assigned kube-system/helm-install-traefik-njn8m to k3s-ya-1
Normal Pulled 19s (x5 over 108s) kubelet Container image "rancher/klipper-helm:v0.4.3" already present on machine
Normal Created 19s (x5 over 108s) kubelet Created container helm
Normal Started 19s (x5 over 107s) kubelet Started container helm
Warning BackOff 6s (x10 over 105s) kubelet Back-off restarting failed container
And here's the logs from that pod.
# kl -n kube-system helm-install-traefik-njn8m
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ '[' '' '!=' true ']'
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ tiller --listen=127.0.0.1:44134 + --storage=secret
helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
[main] 2021/02/09 23:55:07 Starting Tiller v2.16.10 (tls=false)
[main] 2021/02/09 23:55:07 GRPC listening on 127.0.0.1:44134
[main] 2021/02/09 23:55:07 Probes listening on :44135
[main] 2021/02/09 23:55:07 Storage driver is Secret
[main] 2021/02/09 23:55:07 Max history per release is 0
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://charts.helm.sh/stable/
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ helm_v2 ls --all '^traefik$' ++ jq -r '.Releases | length'
--output json
[storage] 2021/02/09 23:55:07 listing all releases with filter
[storage/driver] 2021/02/09 23:55:07 list: failed to list: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%3DTILLER": x509: certificate signed by unknown authority
Error: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%!D(MISSING)TILLER": x509: certificate signed by unknown authority
+ EXIST=
+ '[' '' == 1 ']'
+ '[' '' == v2 ']'
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/traefik.tgz.base64
+ CHART_PATH=/traefik.tgz
+ '[' '!' -f /chart/traefik.tgz.base64 ']'
+ return
+ '[' install '!=' delete ']'
+ helm_repo_init
+ grep -q -e 'https\?://'
chart path is a url, skipping repo update
+ echo 'chart path is a url, skipping repo update'
+ helm_v3 repo remove stable
Error: no repositories configured
+ true
+ return
+ helm_update install
+ '[' helm_v3 == helm_v3 ']'
++ ++ helm_v3 ls -f '^traefik$' --namespace jq kube-system -r --output '"\(.[0].app_version),\(.[0].status)"'json
++ tr '[:upper:]' '[:lower:]'
Error: Kubernetes cluster unreachable: Get "https://10.43.0.1:443/version?timeout=32s": x509: certificate signed by unknown authority
+ LINE=
++ echo
++ cut -f1 -d,
+ INSTALLED_VERSION=
++ echo
++ cut -f2 -d,
+ STATUS=
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ '[' install = delete ']'
+ '[' -z '' ']'
+ '[' -z '' ']'
+ helm_v3 install traefik https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz --values /config/values-01_HelmChart.yaml
Error: failed to download "https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz" (hint: running `helm repo update` may help)
OK, last thing to try before I declare your certs well and proper hosed:
rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json
kubectl delete secret -n kube-system k3s-serving
systemctl restart k3s
journalctl -u k3s | grep 'k3s-serving\|CN=k3s,O=k3s'
After that, delete the Job again and see if the pod runs successfully.
Firehose on...
# rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json
rm: remove regular file ‘/var/lib/rancher/k3s/server/tls/dynamic-cert.json’? y
# kubectl delete secret -n kube-system k3s-serving
secret "k3s-serving" deleted
# systemctl restart k3s
# journalctl -u k3s | grep 'k3s-serving\|CN=k3s,O=k3s'
Feb 09 14:28:07 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:07.600794282-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-09 19:28:07 +0000 UTC"
Feb 09 14:28:13 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:13.714099802-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-09 19:28:13 +0000 UTC"
Feb 09 14:28:13 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:13.717815726-05:00" level=info msg="Updating TLS secret for k3s-serving (count: 16): map[listener.cattle.io/cn-10.1.0.139:10.1.0.139 listener.cattle.io/cn-10.1.0.140:10.1.0.140 listener.cattle.io/cn-10.1.0.164:10.1.0.164 listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.1.0.84:10.1.0.84 listener.cattle.io/cn-10.1.0.85:10.1.0.85 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-k3s-ya-2.:k3s-ya-2. listener.cattle.io/cn-k3s.:k3s. listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=0E94D246BD6203C7A0353117828477705A5F51B0]"
Feb 09 14:28:13 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:13.721833744-05:00" level=info msg="Active TLS secret k3s-serving (ver=108089263) (count 16): map[listener.cattle.io/cn-10.1.0.139:10.1.0.139 listener.cattle.io/cn-10.1.0.140:10.1.0.140 listener.cattle.io/cn-10.1.0.164:10.1.0.164 listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.1.0.84:10.1.0.84 listener.cattle.io/cn-10.1.0.85:10.1.0.85 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-k3s-ya-2.:k3s-ya-2. listener.cattle.io/cn-k3s.:k3s. listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=0E94D246BD6203C7A0353117828477705A5F51B0]"
Feb 09 19:55:42 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:42.034412984-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-10 00:55:42 +0000 UTC"
Feb 09 19:55:48 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:48.147574740-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148589) (count 7): map[listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CCE5CF00DA34DD811579869EC9196549F627F4A6]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.072638937-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-10 00:55:49 +0000 UTC"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.075930185-05:00" level=info msg="Updating TLS secret for k3s-serving (count: 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=71E6018A05F0851D7B893FB1F8B8B83DFAAF848F]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.080154345-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148601) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=71E6018A05F0851D7B893FB1F8B8B83DFAAF848F]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.119200588-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148602) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CBD53CE8401D1E09D2D8C50BFED40810D03E35C7]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.124302788-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148603) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=71E6018A05F0851D7B893FB1F8B8B83DFAAF848F]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.129007974-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148604) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CBD53CE8401D1E09D2D8C50BFED40810D03E35C7]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.135130514-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-10 00:55:49 +0000 UTC"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.137025026-05:00" level=info msg="Updating TLS secret for k3s-serving (count: 9): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CD609826BFA62B9F651AC51AFE546E95C3D50562]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.140836592-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148606) (count 9): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CD609826BFA62B9F651AC51AFE546E95C3D50562]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.635968629-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148621) (count 9): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=A6D448BF0AD30679603D049B08676028C4CEB141]"
(BTW, I searched/replace to remove the domain from the hostnames so there's some misplaced .'s)
# k get nodes
NAME STATUS ROLES AGE VERSION
k3s-ya-1 Ready master 5h34m v1.19.7+k3s1
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66c464876b-24dh7 0/1 Running 0 4m53s
helm-install-traefik-njn8m 0/1 CrashLoopBackOff 22 68m
local-path-provisioner-7ff9579c6-745hm 0/1 CrashLoopBackOff 5 4m53s
metrics-server-7b4f8b595-w6wsp 0/1 CrashLoopBackOff 5 4m52s
Damn.
Ok, so I then deleted all of the above matching manifests to end up here:
# kubectl get pods -n kube-system
No resources found in kube-system namespace.
And.
# kaf traefik.yaml
helmchart.helm.cattle.io/traefik created
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
helm-install-traefik-279jm 0/1 CrashLoopBackOff 2 28s
Normally that's a completely harmless thing to do. The fact that it's worse after that suggests that there is some severe disagreement within the cluster about what CA certificates are trusted. I suspect that you had multiple nodes with different cluster CAs signing different certificates within the cluster. At this point I'm not sure it's worth the work to clean up.
Yup I'm going to call it. The only reason that I spun up k3s in the first place was to take a look at Rancher. Now that Rancher can run on any k8s will spin up a bare-metal k8s cluster instead. However, I'm also reconsidering the Rancher approach as well now. Am more interested in stability from an SRE POV than a pretty face at this point. Although the longhorn project is very appealing...
The bare-metal k8s cluster I setup a few years ago is still stable without any significant interventions/upgrades needed so far and it survives reboots hard or soft which is what seems to be the root cause of the issues in k3s.
@brandond your support and awesome responsiveness and to jump right in and help are hugely appreciated!
I had the same error message now after uninstalling and re-installing K3S. Turns out the problem was my ~/.kube/config
was still referring to the old cluster. Delete that and then cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
to get the new context.
I had the same error message because I wanted to use 443 for kube config, so I was port forwarding 443 to 6443 through firewalld. When Traefik was down it worked, once it started up it didn't.
I had the same error message now after uninstalling and re-installing K3S. Turns out the problem was my
~/.kube/config
was still referring to the old cluster. Delete that and thencp /etc/rancher/k3s/k3s.yaml ~/.kube/config
to get the new context.
And also consider this command:
cp /etc/rancher/rke2/rke2.yaml .kube/config
:)
Environmental Info: K3s Version:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
3 masters
Describe the bug:
However, the result is inconsistent. Sometimes the first master node will work but 2nd and 3rd node
Unable to connect to the server: x509: certificate signed by unknown authority
Steps To Reproduce:
etcd certs are copied into /root
First node - k3s-ya-1
^^^ this result is inconsistent - sometimes works, sometimes not
cat /var/lib/rancher/k3s/server/node-token
to get token for use with additional nodes.2nd node - k3s-ya-2
^^^ this time it worked - last 3 attempts 2nd node didn't work but the 1st node did - go figure.
3rd node
k3s-uninstall.sh export INSTALL_K3S_VERSION=v1.19.7+k3s1 export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8.:2379 export K3S_DATASTORE_CAFILE=/root/ca.crt export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key export K3S_TOKEN=--from first node-- export K3S_URL=https://k3s:6443 export K3S_KUBECONFIG_OUTPUT=/root/kube.confg k3s.install server
^^^ more expected - 1/2 the time yields
Unable to connect to the server: x509: certificate signed by unknown authority
Expected behavior:
Consistent behavior after k3s sever is installed.
kubectl
should work without certificate errors across all nodes.Actual behavior:
Inconsistent. Some nodes
Unable to connect to the server: x509: certificate signed by unknown authority
others can. Uninstall and repeat - different results.Yesterday entire cluster was working as expected no errors across all nodes with Rancher installed and running another cluster as expected. Today,
Unable to connect to the server: x509: certificate signed by unknown authority
on every k3s node.It's almost like the certificates are playing musical chairs.
Additional context / logs:
Samples from
/var/log/messages