k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
26.62k stars 2.24k forks source link

K3S startup stuck in a deadlock when a KMS provider is configured and the node is rebooted #10058

Open jirenugo opened 2 weeks ago

jirenugo commented 2 weeks ago

Environmental Info: K3s Version:

k3s version v1.29.4+k3s1 (94e29e2e)
go version go1.21.9

Node(s) CPU architecture, OS, and Version:

Linux TDC1792640621 5.15.0-1061-azure #70~20.04.1-Ubuntu SMP Mon Apr 8 15:38:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Describe the bug:

Steps To Reproduce:

Expected behavior:

k3s starts up successfully and starts the KMS pod

Actual behavior:

k3s is waiting for the KMS pod to come up to start the KMS pod because it attempts to decrypt a secret (/registry/secrets/kube-system/k3s-serving) that is now encrypted by the KMS provider

Are there any workarounds for this issue? Is it possible to configure k3s to store the bootstrap secrets as a different resource type so that they may be exempted from KMS encryption.

Additional context / logs:

Logs from the systemd service attempting to decrypt the secret protected by KMS:

● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: activating (start) since Wed 2024-05-01 22:09:30 UTC; 6min ago
       Docs: https://k3s.io
   Main PID: 1378 (k3s-server)
      Tasks: 75
     Memory: 682.4M
     CGroup: /system.slice/k3s.service
             ├─1378 /usr/local/bin/k3s server
             └─2167 containerd

May 01 22:15:58 TDC1792640621 k3s[1378]: I0501 22:15:58.418075    1378 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
May 01 22:15:58 TDC1792640621 k3s[1378]: E0501 22:15:58.418123    1378 controller.go:102] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, >
May 01 22:15:58 TDC1792640621 k3s[1378]: , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
May 01 22:15:58 TDC1792640621 k3s[1378]: I0501 22:15:58.419228    1378 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
May 01 22:15:58 TDC1792640621 k3s[1378]: E0501 22:15:58.569280    1378 transformer.go:163] "failed to decrypt data" err="got unexpected nil transformer"
May 01 22:15:58 TDC1792640621 k3s[1378]: W0501 22:15:58.569326    1378 reflector.go:539] storage/cacher.go:/secrets: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-serving": got unexpected>
May 01 22:15:58 TDC1792640621 k3s[1378]: E0501 22:15:58.569335    1378 cacher.go:475] cacher (secrets): unexpected ListAndWatch error: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-servin>
May 01 22:15:59 TDC1792640621 k3s[1378]: E0501 22:15:59.570628    1378 transformer.go:163] "failed to decrypt data" err="got unexpected nil transformer"
May 01 22:15:59 TDC1792640621 k3s[1378]: W0501 22:15:59.570669    1378 reflector.go:539] storage/cacher.go:/secrets: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-serving": got unexpected>
May 01 22:15:59 TDC1792640621 k3s[1378]: E0501 22:15:59.570679    1378 cacher.go:475] cacher (secrets): unexpected ListAndWatch error: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-servin>
brandond commented 2 weeks ago

The KMS provider itself runs as a pod on the cluster.

I'm not familiar with this deployment pattern for KMS providers - why are you trying to do this? It suffers from the obvious chicken-and-egg problem you're running into here, where the cluster can't start because it needs access to something that won't be available until after it's up.

You're trying to figure out how to lock your keys in the car but still open the door. I don't think there's a good way to make this work.

jirenugo commented 2 weeks ago

The KMS provider itself runs as a pod on the cluster.

This is not an uncommon pattern for KMS deployment. Arguably k3s has a circular dependency on kubernetes secrets. It is unfortunate that this is not part of the conformance tests, at least as far as I can tell.

https://github.com/kubernetes-sigs/aws-encryption-provider https://github.com/Azure/kubernetes-kms?tab=readme-ov-file https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/barbican-kms-plugin/using-barbican-kms-plugin.md https://github.com/Tencent/tke-kms-plugin/blob/90b71a5c7d78a564567040ebe1ce7135afe99ce5/deployment/tke-kms-plugin.yaml#L4

brandond commented 1 week ago

K3s uses secrets for a couple things internally:

Both of these should soft-fail and retry until secrets can be read. Where exactly does k3s startup stall?

I see that https://github.com/kubernetes-sigs/aws-encryption-provider for example suggests running the KMS as a static pod - are you doing that by placing the pod spec in a file in /var/lib/rancher/k3s/agent/pod-manifests/, or are you trying to deploy it via kubectl apply?

jirenugo commented 1 week ago

suggests running the KMS as a static pod

Yes. Static pods have the same issue.

Both of these should soft-fail and retry until secrets can be read. Where exactly does k3s startup stall?

I don't know. I attached the logs from the systemd service in the issue where it's trying to access /registry/secrets/kube-system/k3s-serving. Does that answer your question? Why does it hard fail on this secret? I can get more logs if you share instructions.