k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.08k stars 2.35k forks source link

Unable to connect to the server: x509: certificate signed by unknown authority - inconsistent behavior #2914

Closed recipedude closed 3 years ago

recipedude commented 3 years ago

Environmental Info: K3s Version:

# k3s -v
k3s version v1.19.7+k3s1 (5a00e38d)

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:

3 masters

Describe the bug:

# kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority
[root@k3s-ya-1 ~]# k3s kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority
[root@k3s-ya-1 ~]#

However, the result is inconsistent. Sometimes the first master node will work but 2nd and 3rd node Unable to connect to the server: x509: certificate signed by unknown authority

Steps To Reproduce:

etcd certs are copied into /root

First node - k3s-ya-1

k3s-uninstall.sh
export INSTALL_K3S_VERSION=v1.19.7+k3s1 
export K3S_DATASTORE_CAFILE=/root/ca.crt
export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt
export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key
export K3S_KUBECONFIG_OUTPUT=/root/kube.confg
export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8s:2379
k3s.install server 
# kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority

^^^ this result is inconsistent - sometimes works, sometimes not

cat /var/lib/rancher/k3s/server/node-token to get token for use with additional nodes.

2nd node - k3s-ya-2

k3s-uninstall.sh
export INSTALL_K3S_VERSION=v1.19.7+k3s1 
export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8s:2379
export K3S_DATASTORE_CAFILE=/root/ca.crt
export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt
export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key
export K3S_TOKEN=--from first node--
export K3S_URL=https://k3s:6443
export K3S_KUBECONFIG_OUTPUT=/root/kube.confg
k3s.install server
# kubectl get nodes
NAME                  STATUS   ROLES                  AGE    VERSION
k3s-ya-1   Ready    control-plane,master   2d5h   v1.19.7+k3s1
k3s-ya-2   Ready    control-plane,master   36h    v1.19.7+k3s1

^^^ this time it worked - last 3 attempts 2nd node didn't work but the 1st node did - go figure.

3rd node

k3s-uninstall.sh export INSTALL_K3S_VERSION=v1.19.7+k3s1 export K3S_DATASTORE_ENDPOINT=https://etcd1.k8s:2379,https://etcd2.k8s,https://etcd3.k8.:2379 export K3S_DATASTORE_CAFILE=/root/ca.crt export K3S_DATASTORE_CERTFILE=/root/apiserver-etcd-client.crt export K3S_DATASTORE_KEYFILE=/root/apiserver-etcd-client.key export K3S_TOKEN=--from first node-- export K3S_URL=https://k3s:6443 export K3S_KUBECONFIG_OUTPUT=/root/kube.confg k3s.install server

# kubectl get nodes
NAME                  STATUS   ROLES                  AGE    VERSION
k3s-ya-1   Ready    control-plane,master   2d5h   v1.19.7+k3s1
k3s-ya-2   Ready    control-plane,master   36h    v1.19.7+k3s1
k3s-ya-3   Ready    master                 19s    v1.19.7+k3s1

^^^ more expected - 1/2 the time yields Unable to connect to the server: x509: certificate signed by unknown authority

Expected behavior:

Consistent behavior after k3s sever is installed. kubectl should work without certificate errors across all nodes.

Actual behavior:

Inconsistent. Some nodes Unable to connect to the server: x509: certificate signed by unknown authority others can. Uninstall and repeat - different results.

Yesterday entire cluster was working as expected no errors across all nodes with Rancher installed and running another cluster as expected. Today, Unable to connect to the server: x509: certificate signed by unknown authority on every k3s node.

It's almost like the certificates are playing musical chairs.

Additional context / logs:

Samples from /var/log/messages

Feb  8 20:55:42 k3s-ya-1 k3s: time="2021-02-08T20:55:42.901658387-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:55:42 http: TLS handshake error from 10.1.0.84:43082: remote error: tls: bad certificate"
Feb  8 20:55:43 k3s-ya-1 k3s: time="2021-02-08T20:55:43.012864767-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:55:43 http: TLS handshake error from 10.42.2.175:46490: remote error: tls: bad certificate"
Feb  8 20:56:37 k3s-ya-2 k3s: time="2021-02-08T20:56:37.629125982-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:56:37 http: TLS handshake error from 10.1.0.85:35180: remote error: tls: bad certificate"
Feb  8 20:56:37 k3s-ya-2 k3s: time="2021-02-08T20:56:37.840388714-05:00" level=info msg="Cluster-Http-Server 2021/02/08 20:56:37 http: TLS handshake error from 10.1.0.83:42518: remote error: tls: bad certificate"
Feb  8 20:57:49 k3s-ya-3 k3s: E0208 20:57:49.215716     829 event.go:273] Unable to write event: 'Patch "https://127.0.0.1:6443/api/v1/namespaces/kube-system/events/helm-install-traefik-4lncd.1661f1476f8d4e12": x509: certificate signed by unknown authority' (may retry after sleeping)
Feb  8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.361122818-05:00" level=info msg="Connecting to proxy" url="wss://10.1.0.81:6443/v1-k3s/connect"
Feb  8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.362442817-05:00" level=error msg="Failed to connect to proxy" error="x509: certificate signed by unknown authority"
Feb  8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.362463456-05:00" level=error msg="Remotedialer proxy error" error="x509: certificate signed by unknown authority"
Feb  8 20:57:49 k3s-ya-3 k3s: time="2021-02-08T20:57:49.367212594-05:00" level=info msg="Connecting to proxy" url="wss://10.1.0.82:6443/v1-k3s/connect"
brandond commented 3 years ago

You don't need to set K3S_URL (--server) when using an external datastore; this is only for use when joining agents or using embedded etcd.

I am curious how you came to have two nodes with the control-plane role label. This wasn't added until 1.20, yet your nodes are all still on 1.19. Did you upgrade temporarily, and then downgrade again?

In the past I have seen behavior like this when servers were all brought up at the same time and raced to bootstrap the cluster CA certs, or when nodes were started up with existing certs from a different cluster that they then try to use instead of the ones recognized by the rest of the cluster.

It sounds like these nodes have been through some odd things. I run my personal cluster with an external etcd and haven't had any problems with it; I suspect something in the way you started up, upgrade, or grew this cluster has left it very confused about what certificates to use.

recipedude commented 3 years ago

You don't need to set K3S_URL

Added the KS3_URL to see if that would make any difference. Started second guessing and wondering if the 2nd/3rd nodes were fighting with the first then added K3S_URL wondering if that was the missing piece but didn't seem to make a difference or change any behavior. Nice to get some clarification that it's only agent nodes that need it.

curious how you came to have two nodes with the control-plane role label

Was wondering about that as well. The control-plane label comes and goes. At one point I saw nodes reporting 1.20 in the get nodes output even though INSTALL_K3S_VERSION=v1.19.7+k3s1 env var was set which felt weird. There was an initial upgrade run without that env var set a couple of days ago and the nodes all ended up on v1.20+ but then Rancher refused to install so I uninstalled each node and, added the env var to downgrade so that rancher would install.

That got everything running again - but then the next day Unable to connect to the server: x509: certificate signed by unknown authority

seen behavior like this when servers were all brought up at the same time

Makes sense. Am starting/upgrading/uninstalling nodes one by one through this process so doubt that was happening on these re-installs. Although, the initial failure was due to the physical server (it's a KVM linux box with each node being a VM) being shut down gracefully and restarted so the nodes would have been re-starting pretty much at the same time when the box was powered back up.

sounds like these nodes have been through some odd things.

Feels the same. Initial installation last March, testing for a week (k3s+rancher) - then they just lay there idling until powering down the box and powering back up and the cluster was broken.

FYI, the original nodes running this cluster have all been deleted and replaced with branch new fresh 'n clean VMs hoping to purge any weirdness.

It's disconcerting that there doesn't seem to be a path to recover this sick cluster. There's no way I would feel confident going into production if a reboot (graceful or otherwise) could throw things into an unrecoverable state.

Where exactly is the configuration for the certificate(s) on each node located?

brandond commented 3 years ago

The control-plane label comes and goes. At one point I saw nodes reporting 1.20 in the get nodes output even though INSTALL_K3S_VERSION=v1.19.7+k3s1 env var was set which felt weird.

That variable is only used by the install script. Do you somehow have something running that is reinstalling and restarting K3s? It doesn't self-update, although the system-upgrade-controller (available through Rancher) will create jobs to do rolling updates to the cluster. Is that something you were playing with at some point? Do you perhaps have multiple nodes (old VMs or something) with duplicate hostnames pointed at the etcd datastore?

It's disconcerting that there doesn't seem to be a path to recover this sick cluster.

At the very least you should be able to stabilize the cluster by going down to a single server node so that they're not all arguing about certificates. Just uninstall all but one, delete the nodes, make sure the local disk is clean, then reinstall one at a time. This assumes you figure out what it is that's changing your versions...

Where exactly is the configuration for the certificate(s) on each node located?

/var/lib/rancher/k3s/server/tls/

recipedude commented 3 years ago

Do you somehow have something running that is reinstalling and restarting K3s?

Hmm, the VMs are deployed with chef... manually walked through the chef deployment and not seeing anywhere it's running or restart k3s best I can tell - to be certain I did comment out the k3s install section where it downloads the k3s.install script to make sure it's not the culprit.

system-upgrade-controller

Haven't played with that at all. Overall it's very plain jane and not much experimentation beyond installing Rancher on top of k3s.

you should be able to stabilize the cluster by going down to a single server node

Great idea, have done exactly that. Now running a freshly re-installed single node and keeping an eye on the timestamps in /var/lib/rancher/k3s/server/tls/ hoping that will at least tell me if those certs are getting changed by anything.

The kube-system pods aren't happy and the now single node cluster is limping along in poor shape.

NAME                         READY   STATUS             RESTARTS   AGE
coredns-66c464876b-cm94q     0/1     Running            0          68m
helm-install-traefik-cxbm2   0/1     CrashLoopBackOff   18         70m
svclb-traefik-4t79s          2/2     Running            0          83m
traefik-6f9cbd9bd4-f9r92     1/1     Running            0          81m

helm-install-traefik is crashlooping and coredns is stuck with this in the logs:

[INFO] plugin/ready: Still waiting on: "kubernetes"
E0209 20:54:55.692578       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:55.694071       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get "https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:55.694896       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:56.694804       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:56.695503       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get "https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E0209 20:54:56.696994       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority

Have kubectl delete -f /var/lib/rancher/k83/server/manifests/coredns.yaml and re-applied hoping it would clean itself up but no luck there so far.

Not sure why it's continuing to internally whinge about the certs now that it's a single node.

brandond commented 3 years ago

You might try deleting the crashlooping pod so that it can be recreated with the correct cluster CA certs.

recipedude commented 3 years ago

:) I have deleted those mis-behaving pods dozens of times now. As well as deleted the manifests entirely then re-apply them once confirmed the pods were indeed destroyed.

# kg -n kube-system
NAME                         READY   STATUS             RESTARTS   AGE
coredns-66c464876b-cm94q     0/1     Running            0          130m
helm-install-traefik-cxbm2   0/1     CrashLoopBackOff   30         133m
svclb-traefik-4t79s          2/2     Running            0          145m
traefik-6f9cbd9bd4-f9r92     1/1     Running            0          144m
[root@k3s-ya-1 server]# k -n kube-system delete pod helm-install-traefik-cxbm2
pod "helm-install-traefik-cxbm2" deleted
[root@k3s-ya-1 server]# kg -n kube-system
NAME                         READY   STATUS             RESTARTS   AGE
coredns-66c464876b-cm94q     0/1     Running            0          131m
helm-install-traefik-6c9jx   0/1     CrashLoopBackOff   2          39s
svclb-traefik-4t79s          2/2     Running            0          146m
traefik-6f9cbd9bd4-f9r92     1/1     Running            0          145m
brandond commented 3 years ago

Did you delete the Helm Job that the pod is coming from?

recipedude commented 3 years ago

Poking at that crashlooping pod some more:

# k -n kube-system logs helm-install-traefik-6c9jx
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
--snip--
[storage] 2021/02/09 21:57:16 listing all releases with filter
[storage/driver] 2021/02/09 21:57:16 list: failed to list: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%3DTILLER": x509: certificate signed by unknown authority
Error: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%!D(MISSING)TILLER": x509: certificate signed by unknown authority
--snip--
chart path is a url, skipping repo update
Error: no repositories configured
--snip--
Error: Kubernetes cluster unreachable: Get "https://10.43.0.1:443/version?timeout=32s": x509: certificate signed by unknown authority
--snip--
+ helm_v3 install traefik https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz --values /config/values-01_HelmChart.yaml
Error: failed to download "https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz" (hint: running `helm repo update` may help)
recipedude commented 3 years ago

Did you delete the Helm Job that the pod is coming from?

Pretty sure it's been deleted. Am assuming that kubectl delete -f traefik.yaml (in the /var/lib/rancher/k3s/server/manifests folder) would delete the helm job. That does seem to completely kill off that pod.

# kdf traefik.yaml
helmchart.helm.cattle.io "traefik" deleted

[root@k3s-ya-1 manifests]# kg -n kube-system
NAME                       READY   STATUS    RESTARTS   AGE
coredns-66c464876b-cm94q   0/1     Running   0          139m
svclb-traefik-4t79s        2/2     Running   0          154m
traefik-6f9cbd9bd4-f9r92   1/1     Running   0          153m

[root@k3s-ya-1 manifests]# kaf traefik.yaml
helmchart.helm.cattle.io/traefik created

[root@k3s-ya-1 manifests]# kg -n kube-system
NAME                         READY   STATUS             RESTARTS   AGE
coredns-66c464876b-cm94q     0/1     Running            0          139m
helm-install-traefik-4x847   0/1     CrashLoopBackOff   1          11s
svclb-traefik-4t79s          2/2     Running            0          154m
traefik-6f9cbd9bd4-f9r92     1/1     Running            0          153m
brandond commented 3 years ago

OK this bit is interesting:

  • cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
  • update-ca-certificates WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping

Somehow the cluster CA has multiple certs in it? Can you cat /var/lib/rancher/k3s/server/tls/server-ca.crt on the server?

recipedude commented 3 years ago

Yup - have been eyeing that same log message - multiple certs!? What the...

# cat /var/lib/rancher/k3s/server/tls/server-ca.crt
-----BEGIN CERTIFICATE-----
MIIBdzCCAR2gAwIBAgIBADAKBggqhkjOPQQDAjAjMSEwHwYDVQQDDBhrM3Mtc2Vy
dmVyLWNhQDE2MTI2NDE0OTcwHhcNMjEwMjA2MTk1ODE3WhcNMzEwMjA0MTk1ODE3
WjAjMSEwHwYDVQQDDBhrM3Mtc2VydmVyLWNhQDE2MTI2NDE0OTcwWTATBgcqhkjO
PQIBBggqhkjOPQMBBwNCAATgE2WSc1B+7yNB3IOxahlI80B+uDNqtQ2OG+shRQtd
uuN3ehchBXgZ/7EzmT5QzKD/OWxgDs6D7GGrHfCRzH+so0IwQDAOBgNVHQ8BAf8E
BAMCAqQwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUNguwKhD0HcYEZGwVvs3K
d1XfuGkwCgYIKoZIzj0EAwIDSAAwRQIgW3K54s1DChzOJllhZMBhrBv+zFsmGjg+
/TthN/1Z6U0CIQDu0BZo11CYar1F5h9gyfRLspMLxglCKtXrCwMgHYq2yQ==
-----END CERTIFICATE-----

Looks like a single cert to me. This is so weird - but it certainly is a crash course in troubleshooting k3s!

brandond commented 3 years ago

Oh, it turns out that WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping is an upstream issue from Alpine, unrelated to the CA cert we are dropping. I still think that the error is related to the service account somehow. Can you try:

kubectl delete serviceaccount -n kube-system helm-traefik
kubectl delete job -n helm-system helm-install-traefik
recipedude commented 3 years ago
# kubectl delete serviceaccount -n kube-system helm-traefik
serviceaccount "helm-traefik" deleted
# kubectl delete job -n helm-system helm-install-traefik
Error from server (NotFound): jobs.batch "helm-install-traefik" not found
# kubectl delete -f traefik.yaml
helmchart.helm.cattle.io "traefik" deleted

Still seeing some traefik pods running though which makes me wonder why they're still there.

# kg -n kube-system
NAME                       READY   STATUS    RESTARTS   AGE
coredns-66c464876b-cm94q   0/1     Running   0          3h53m
svclb-traefik-4t79s        2/2     Running   0          4h8m
traefik-6f9cbd9bd4-f9r92   1/1     Running   0          4h7m

And same crashloop when re-applying traefik.yaml...

So I also deleted the traefik deployment and the svclib-traefik (sp?) daemonset, redeleted the traefik serviceccount. the traefik service and traefik-prometheus and coredns just for good measure.

Ended up here:

# k -n kube-system get pods,svc,ds,rs,deploy,jobs,ingress
Warning: extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
No resources found in kube-system namespace.

Alrighty, looks like kube-system is more or less fully nuked.

# kaf traefik.yaml
helmchart.helm.cattle.io/traefik created
# kg -n kube-system
NAME                         READY   STATUS             RESTARTS   AGE
helm-install-traefik-njn8m   0/1     CrashLoopBackOff   2          35s

Darn.

Here's output of describe on that pod.

# k -n kube-system describe pod helm-install-traefik-njn8m
Name:         helm-install-traefik-njn8m
Namespace:    kube-system
Priority:     0
Node:         k3s-ya-1/10.1.0.83
Start Time:   Tue, 09 Feb 2021 18:52:17 -0500
Labels:       controller-uid=9d44a166-f197-4800-8d9d-f6f81113ccdd
              helmcharts.helm.cattle.io/chart=traefik
              job-name=helm-install-traefik
Annotations:  helmcharts.helm.cattle.io/configHash: SHA256=54DADC5C41A9E92996BEB90979244F7E4F0D86B23C3F54AAF5BBC497C412496E
Status:       Running
IP:           10.42.1.205
IPs:
  IP:           10.42.1.205
Controlled By:  Job/helm-install-traefik
Containers:
  helm:
    Container ID:  containerd://cca390ef69eb6a7b2bc3a7caca0d6b159ffd558e8d840d0814421d3fac6e720c
    Image:         rancher/klipper-helm:v0.4.3
    Image ID:      docker.io/rancher/klipper-helm@sha256:b319bce4802b8e42d46e251c7f9911011a16b4395a84fa58f1cf4c788df17139
    Port:          <none>
    Host Port:     <none>
    Args:
      install
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 09 Feb 2021 18:53:46 -0500
      Finished:     Tue, 09 Feb 2021 18:53:46 -0500
    Ready:          False
    Restart Count:  4
    Environment:
      NAME:              traefik
      VERSION:
      REPO:
      HELM_DRIVER:       secret
      CHART_NAMESPACE:   kube-system
      CHART:             https://%{KUBERNETES_API}%/static/charts/traefik-1.81.0.tgz
      HELM_VERSION:
      TARGET_NAMESPACE:  kube-system
      NO_PROXY:          .svc,.cluster.local,10.42.0.0/16,10.43.0.0/16
    Mounts:
      /chart from content (rw)
      /config from values (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from helm-traefik-token-jwpr6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  values:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chart-values-traefik
    Optional:  false
  content:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chart-content-traefik
    Optional:  false
  helm-traefik-token-jwpr6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  helm-traefik-token-jwpr6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  108s                default-scheduler  Successfully assigned kube-system/helm-install-traefik-njn8m to k3s-ya-1
  Normal   Pulled     19s (x5 over 108s)  kubelet            Container image "rancher/klipper-helm:v0.4.3" already present on machine
  Normal   Created    19s (x5 over 108s)  kubelet            Created container helm
  Normal   Started    19s (x5 over 107s)  kubelet            Started container helm
  Warning  BackOff    6s (x10 over 105s)  kubelet            Back-off restarting failed container

And here's the logs from that pod.

# kl -n kube-system helm-install-traefik-njn8m
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ '[' '' '!=' true ']'
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ tiller --listen=127.0.0.1:44134 + --storage=secret
helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
[main] 2021/02/09 23:55:07 Starting Tiller v2.16.10 (tls=false)
[main] 2021/02/09 23:55:07 GRPC listening on 127.0.0.1:44134
[main] 2021/02/09 23:55:07 Probes listening on :44135
[main] 2021/02/09 23:55:07 Storage driver is Secret
[main] 2021/02/09 23:55:07 Max history per release is 0
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://charts.helm.sh/stable/
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ helm_v2 ls --all '^traefik$' ++ jq -r '.Releases | length'
--output json
[storage] 2021/02/09 23:55:07 listing all releases with filter
[storage/driver] 2021/02/09 23:55:07 list: failed to list: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%3DTILLER": x509: certificate signed by unknown authority
Error: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system/secrets?labelSelector=OWNER%!D(MISSING)TILLER": x509: certificate signed by unknown authority
+ EXIST=
+ '[' '' == 1 ']'
+ '[' '' == v2 ']'
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/traefik.tgz.base64
+ CHART_PATH=/traefik.tgz
+ '[' '!' -f /chart/traefik.tgz.base64 ']'
+ return
+ '[' install '!=' delete ']'
+ helm_repo_init
+ grep -q -e 'https\?://'
chart path is a url, skipping repo update
+ echo 'chart path is a url, skipping repo update'
+ helm_v3 repo remove stable
Error: no repositories configured
+ true
+ return
+ helm_update install
+ '[' helm_v3 == helm_v3 ']'
++ ++ helm_v3 ls -f '^traefik$' --namespace jq kube-system -r --output '"\(.[0].app_version),\(.[0].status)"'json

++ tr '[:upper:]' '[:lower:]'
Error: Kubernetes cluster unreachable: Get "https://10.43.0.1:443/version?timeout=32s": x509: certificate signed by unknown authority
+ LINE=
++ echo
++ cut -f1 -d,
+ INSTALLED_VERSION=
++ echo
++ cut -f2 -d,
+ STATUS=
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ '[' install = delete ']'
+ '[' -z '' ']'
+ '[' -z '' ']'
+ helm_v3 install traefik https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz --values /config/values-01_HelmChart.yaml
Error: failed to download "https://10.43.0.1:443/static/charts/traefik-1.81.0.tgz" (hint: running `helm repo update` may help)
brandond commented 3 years ago

OK, last thing to try before I declare your certs well and proper hosed:

rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json
kubectl delete secret -n kube-system k3s-serving
systemctl restart k3s
journalctl -u k3s | grep 'k3s-serving\|CN=k3s,O=k3s'

After that, delete the Job again and see if the pod runs successfully.

recipedude commented 3 years ago

Firehose on...

# rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json
rm: remove regular file ‘/var/lib/rancher/k3s/server/tls/dynamic-cert.json’? y
# kubectl delete secret -n kube-system k3s-serving
secret "k3s-serving" deleted
# systemctl restart k3s
# journalctl -u k3s | grep 'k3s-serving\|CN=k3s,O=k3s'
Feb 09 14:28:07 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:07.600794282-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-09 19:28:07 +0000 UTC"
Feb 09 14:28:13 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:13.714099802-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-09 19:28:13 +0000 UTC"
Feb 09 14:28:13 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:13.717815726-05:00" level=info msg="Updating TLS secret for k3s-serving (count: 16): map[listener.cattle.io/cn-10.1.0.139:10.1.0.139 listener.cattle.io/cn-10.1.0.140:10.1.0.140 listener.cattle.io/cn-10.1.0.164:10.1.0.164 listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.1.0.84:10.1.0.84 listener.cattle.io/cn-10.1.0.85:10.1.0.85 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-k3s-ya-2.:k3s-ya-2. listener.cattle.io/cn-k3s.:k3s. listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=0E94D246BD6203C7A0353117828477705A5F51B0]"
Feb 09 14:28:13 k3s-ya-1. k3s[12408]: time="2021-02-09T14:28:13.721833744-05:00" level=info msg="Active TLS secret k3s-serving (ver=108089263) (count 16): map[listener.cattle.io/cn-10.1.0.139:10.1.0.139 listener.cattle.io/cn-10.1.0.140:10.1.0.140 listener.cattle.io/cn-10.1.0.164:10.1.0.164 listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.1.0.84:10.1.0.84 listener.cattle.io/cn-10.1.0.85:10.1.0.85 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-k3s-ya-2.:k3s-ya-2. listener.cattle.io/cn-k3s.:k3s. listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=0E94D246BD6203C7A0353117828477705A5F51B0]"
Feb 09 19:55:42 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:42.034412984-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-10 00:55:42 +0000 UTC"
Feb 09 19:55:48 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:48.147574740-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148589) (count 7): map[listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CCE5CF00DA34DD811579869EC9196549F627F4A6]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.072638937-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-10 00:55:49 +0000 UTC"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.075930185-05:00" level=info msg="Updating TLS secret for k3s-serving (count: 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=71E6018A05F0851D7B893FB1F8B8B83DFAAF848F]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.080154345-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148601) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=71E6018A05F0851D7B893FB1F8B8B83DFAAF848F]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.119200588-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148602) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CBD53CE8401D1E09D2D8C50BFED40810D03E35C7]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.124302788-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148603) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=71E6018A05F0851D7B893FB1F8B8B83DFAAF848F]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.129007974-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148604) (count 8): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CBD53CE8401D1E09D2D8C50BFED40810D03E35C7]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.135130514-05:00" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1612641497: notBefore=2021-02-06 19:58:17 +0000 UTC notAfter=2022-02-10 00:55:49 +0000 UTC"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.137025026-05:00" level=info msg="Updating TLS secret for k3s-serving (count: 9): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CD609826BFA62B9F651AC51AFE546E95C3D50562]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.140836592-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148606) (count 9): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=CD609826BFA62B9F651AC51AFE546E95C3D50562]"
Feb 09 19:55:49 k3s-ya-1. k3s[4559]: time="2021-02-09T19:55:49.635968629-05:00" level=info msg="Active TLS secret k3s-serving (ver=108148621) (count 9): map[listener.cattle.io/cn-10.1.0.81:10.1.0.81 listener.cattle.io/cn-10.1.0.82:10.1.0.82 listener.cattle.io/cn-10.1.0.83:10.1.0.83 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=A6D448BF0AD30679603D049B08676028C4CEB141]"

(BTW, I searched/replace to remove the domain from the hostnames so there's some misplaced .'s)

# k get nodes
NAME       STATUS   ROLES    AGE     VERSION
k3s-ya-1   Ready    master   5h34m   v1.19.7+k3s1
# kubectl get pods -n kube-system
NAME                                     READY   STATUS             RESTARTS   AGE
coredns-66c464876b-24dh7                 0/1     Running            0          4m53s
helm-install-traefik-njn8m               0/1     CrashLoopBackOff   22         68m
local-path-provisioner-7ff9579c6-745hm   0/1     CrashLoopBackOff   5          4m53s
metrics-server-7b4f8b595-w6wsp           0/1     CrashLoopBackOff   5          4m52s

Damn.

Ok, so I then deleted all of the above matching manifests to end up here:

# kubectl get pods -n kube-system
No resources found in kube-system namespace.

And.

# kaf traefik.yaml
helmchart.helm.cattle.io/traefik created
# kubectl get pods -n kube-system
NAME                         READY   STATUS             RESTARTS   AGE
helm-install-traefik-279jm   0/1     CrashLoopBackOff   2          28s
brandond commented 3 years ago

Normally that's a completely harmless thing to do. The fact that it's worse after that suggests that there is some severe disagreement within the cluster about what CA certificates are trusted. I suspect that you had multiple nodes with different cluster CAs signing different certificates within the cluster. At this point I'm not sure it's worth the work to clean up.

recipedude commented 3 years ago

Yup I'm going to call it. The only reason that I spun up k3s in the first place was to take a look at Rancher. Now that Rancher can run on any k8s will spin up a bare-metal k8s cluster instead. However, I'm also reconsidering the Rancher approach as well now. Am more interested in stability from an SRE POV than a pretty face at this point. Although the longhorn project is very appealing...

The bare-metal k8s cluster I setup a few years ago is still stable without any significant interventions/upgrades needed so far and it survives reboots hard or soft which is what seems to be the root cause of the issues in k3s.

@brandond your support and awesome responsiveness and to jump right in and help are hugely appreciated!

cawoodm commented 2 years ago

I had the same error message now after uninstalling and re-installing K3S. Turns out the problem was my ~/.kube/config was still referring to the old cluster. Delete that and then cp /etc/rancher/k3s/k3s.yaml ~/.kube/config to get the new context.

erockwood commented 2 years ago

I had the same error message because I wanted to use 443 for kube config, so I was port forwarding 443 to 6443 through firewalld. When Traefik was down it worked, once it started up it didn't.

Saigut commented 11 months ago

I had the same error message now after uninstalling and re-installing K3S. Turns out the problem was my ~/.kube/config was still referring to the old cluster. Delete that and then cp /etc/rancher/k3s/k3s.yaml ~/.kube/config to get the new context.

And also consider this command: cp /etc/rancher/rke2/rke2.yaml .kube/config :)