k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.03k stars 2.35k forks source link

Failure to read certificates and key files during k3s certificate rotate-ca #10689

Closed TheOnlyWei closed 1 month ago

TheOnlyWei commented 3 months ago

Environmental Info:

K3s Version: 
k3s version v1.27.6+k3s- ()
go version go1.20.10

Node(s) CPU architecture, OS, and Version:

Linux aksee2-ledge 5.15.145.2-1.cm2 #1 SMP Wed Jan 17 15:39:07 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: I have 1 node deployed, so I assume that means I have 1 server and 1 agent?

Describe the bug: After running command k3s certificate rotate-ca --path=/home/keys/, I get the following error:

Output

WARN[0000] failed to read /home/keys/tls/etcd/peer-ca.key
WARN[0000] failed to read /home/keys/tls/server-ca.crt
WARN[0000] failed to read /home/keys/tls/request-header-ca.crt
WARN[0000] failed to read /home/keys/tls/request-header-ca.key
WARN[0000] failed to read /home/keys/tls/etcd/server-ca.key
WARN[0000] failed to read /home/keys/tls/etcd/peer-ca.crt
WARN[0000] failed to read /home/keys/tls/server-ca.key
WARN[0000] failed to read /home/keys/tls/client-ca.key
WARN[0000] failed to read /home/keys/tls/etcd/server-ca.crt
WARN[0000] failed to read /home/keys/tls/client-ca.crt
FATA[0000] see server log for details: https://127.0.0.1:6443/v1-k3s/cert/cacerts?force=false: 500 Internal Server Error certificate error ID 65180

K3s.service log (reformatted for readability)

certificate error ID 65180: failed to validate new CA certificates and keys: 
ETCDServerCA: new CA is self-signed, 
ETCDServerCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts2202349941/tls/etcd/server-ca.key: no such file or directory, 
ETCDPeerCA: new CA is self-signed, 
ETCDPeerCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts2202349941/tls/etcd/peer-ca.key: no such file or directory, 
ServerCA: new CA is self-signed, 
ServerCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts2202349941/tls/server-ca.key: no such file or directory, 
ClientCA: new CA is self-signed, 
ClientCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts2202349941/tls/client-ca.key: no such file or directory, 
RequestHeaderCA: new CA is self-signed, 
RequestHeaderCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts2202349941/tls/request-header-ca.key: no such file or directory

/home/keys/tls directory

root@aksee2-ledge [ /home/keys/tls ]# ls
service.key

It seems if user doesn't have the above files as specified by ControlRuntimeBootstrap, it fails the server call. https://github.com/k3s-io/k3s/blob/0ee714d62b257c90f2f6ace9caa2fb0e78b06e96/pkg/daemons/config/types.go#L290

But documentation implies you can just rotate a single certificate or key: https://docs.k3s.io/cli/certificate#service-account-issuer-key-rotation

From server side, judging by open /tmp/cacerts2202349941/tls/etcd/server-ca.key: no such file or directory errors, this is because the bootstrapData were not copied to the tmpServer's temporary DataDir here: https://github.com/k3s-io/k3s/blob/0ee714d62b257c90f2f6ace9caa2fb0e78b06e96/pkg/server/cert.go#L74

However, I did not see any failed to write to error thrown by WriteToDiskFromStorage function, so I suspect it is because of the continue conditional here: https://github.com/k3s-io/k3s/blob/0ee714d62b257c90f2f6ace9caa2fb0e78b06e96/pkg/bootstrap/bootstrap.go#L78

From CLI side, judging by the warnings WARN[0000] failed to read, the values for the unmarshaled ControlRuntimeBootstrap map for path values such as ETCDServerCA is not actually an empty string even though server-ca.key doesn't exist in my folder /home/keys/tls. Since otherwise it would have hit the if path == "" block above the block that printed WARN[0000] failed to read. https://github.com/k3s-io/k3s/blob/0ee714d62b257c90f2f6ace9caa2fb0e78b06e96/pkg/bootstrap/bootstrap.go#L38

Steps To Reproduce: These are the args:

'["server","--kubelet-arg","volume-plugin-dir=/var/lib/libexec/kubernetes/kubelet-plugins/volume/exec/","--write-kubeconfig-mode","644","--disable","local-storage","--disable","traefik","--disable","metrics-server","--disable","servicelb","--cluster-cidr","10.42.0.0/16","--service-cidr","10.43.0.0/16","--service-node-port-range","30000-32767","--disable-network-policy","true","--system-default-registry","aksiotdevacr.azurecr.io","--kube-controller-arg","flex-volume-plugin-dir=/var/lib/libexec/kubernetes/kubelet-plugins/volume/exec/","--kube-controller-arg","service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key","--kube-apiserver-arg","service-account-issuer=<issuer omitted for length>","--kube-apiserver-arg","service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key","--kube-apiserver-arg","service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key","--flannel-conf","/opt/.aksedge/config/k3s/flannel-conf.json","--tls-san","192.168.0.2","--cluster-init","true","--kube-apiserver-arg","service-account-extend-token-expiration=false","--kube-apiserver-arg","service-account-max-token-expiration=1h00m0s","--config","/var/.eflow/config/k3s/k3s-config.yml"]'

For the exact steps, I used a Microsoft quick-start tool:

  1. Create an Azure VM with the following configurations: size: Standard D4ads v5 (4 vcpus, 16 GiB memory) host requirements: https://learn.microsoft.com/en-us/azure/aks/hybrid/aks-edge-system-requirements
  2. Follow the instructions in this link to set up the host.
  3. Follow the instructions in this link to create the K3s deployment (skip Windows worker nodes).

Expected behavior: Should not throw an error.

Actual behavior: Throws an error. See above.

brandond commented 2 months ago

k3s version v1.27.6+k3s- ()

This is an old release of Kubernetes, and the version string doesn't match our release tag format... is this a custom build?

Did you follow the steps in the docs to append the old keys to the file before rotating it?

FATA[0000] see server log for details: https://127.0.0.1:6443/v1-k3s/cert/cacerts?force=false: 500 Internal Server Error certificate error ID 71168 certificate error ID 18624: failed to validate new CA certificates and keys:

The CLI output and error log are not from the same event. Please read the logs more carefully, and find the error message in the logs with an ID that corresponds to the ID reported by the CLI. I suspect that it's complaining that the existing primary key isn't present in the new file.

TheOnlyWei commented 2 months ago

@brandond Sorry, I ran the commands multiple times and copied different errors. They are related. I reproduced the error with the same messages, and I updated the error logs to have the same error ID 65180. The K3s builds here are simply quick-start builds that passes specific parameter configurations for k3s server command. Can you clarify what you mean by "custom builds"? Here are the MSI files from Microsoft: https://learn.microsoft.com/en-us/azure/aks/hybrid/aks-edge-howto-setup-machine#download-aks-edge-essentials

I also tried key rotation with the same keys in service.key as that existed in /var/lib/rancher/k3s/server/tls/service.key and I got the same error. I forgot to mention that -force parameter fixes the error.

brandond commented 2 months ago

The K3s builds here are simply quick-start builds that passes specific parameter configurations for k3s server command. Can you clarify what you mean by "custom builds"? Here are the MSI files from Microsoft: https://learn.microsoft.com/en-us/azure/aks/hybrid/aks-edge-howto-setup-machine#download-aks-edge-essentials

We don't support those packages, we only support the binaries available here in our GH releases. Also, our releases have version strings that look like this:k3s version v1.27.6+k3s1 (bd04941a) - not k3s version v1.27.6+k3s- ()

I will have to figure out what is going on; I do see the expected messages about various files not being provided which is correct, however something seems to be incorrectly indicating that the empty files contain self-signed certs which is not correct. I suspect that a library function somewhere has changed out from under us.

Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ETCDServerCA not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ETCDServerCAKey not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ETCDPeerCA not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ETCDPeerCAKey not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ServerCA not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ServerCAKey not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ClientCA not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: ClientCAKey not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: PasswdFile not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: RequestHeaderCA not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: RequestHeaderCAKey not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: IPSECKey not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: EncryptionConfig not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=info msg="certificate: EncryptionHash not provided; using current value"
Aug 13 00:47:07 systemd-node-1 k3s[3532]: time="2024-08-13T00:47:07Z" level=error msg="certificate error ID 74792: failed to validate new CA certificates and keys: ETCDServerCA: new CA is self-signed, ETCDServerCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts4288513683/tls/etcd/server-ca.key: no such file or directory, ETCDPeerCA: new CA is self-signed, ETCDPeerCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts4288513683/tls/etcd/peer-ca.key: no such file or directory, ServerCA: new CA is self-signed, ServerCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts4288513683/tls/server-ca.key: no such file or directory, ClientCA: new CA is self-signed, ClientCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts4288513683/tls/client-ca.key: no such file or directory, RequestHeaderCA: new CA is self-signed, RequestHeaderCAKey: new CA cert and key cannot be loaded as X590KeyPair: open /tmp/cacerts4288513683/tls/request-header-ca.key: no such file or directory"

If you run the command with --force it should do what you want despite the warnings.

brandond commented 2 months ago

I see the issue - the docs assume that you're already using custom CAs when attempting to rotate just the service account signing key. If you're using the default self-signed CAs and try to rotate just the service account signing key, it takes issue with the current CA values - which it shouldn't be validating.

I'll have it skip validation if reusing the current files.

VestigeJ commented 1 month ago

https://github.com/k3s-io/k3s/issues/10741#issuecomment-2347343723