rancher / system-upgrade-controller

In your Kubernetes, upgrading your nodes
Apache License 2.0
676 stars 83 forks source link

Yet another case of "x509: certificate signed by unknown authority, requeuing" #278

Open johanneskastl opened 7 months ago

johanneskastl commented 7 months ago

Version v0.13.1

Platform/Architecture openSUSE MicroOS 20231126 (immutable based on openSUSE Tumbleweed)

Describe the bug

time="2023-11-28T06:18:40Z" level=error msg="error syncing 'system-upgrade/k3s-server': handler system-upgrade-controller: Get \"https://update.k3s.io/v1-release/channels/stable\": x509: certificate signed by unknown authority, requeuing"

To Reproduce Use the following in your plan instead of a version:

channel: https://update.k3s.io/v1-release/channels/stable

Expected behavior The TLS certificate should be accepted.

Actual behavior Something goes wrong when trying to connect via HTTPS

I checked the mounts in the deployment, and all of them are existing on the host:

        volumeMounts:
        - mountPath: /etc/ssl
          name: etc-ssl
          readOnly: true
        - mountPath: /etc/pki
          name: etc-pki
          readOnly: true
        - mountPath: /etc/ca-certificates
          name: etc-ca-certificates
          readOnly: true
        - mountPath: /tmp
          name: tmp
$ ls -ld /etc/pki/ /etc/ssl/ /etc/ca-certificates/
drwxr-xr-x. 1 root root  16 14. Jun 20:05 /etc/ca-certificates//
drwxr-xr-x. 1 root root  10 22. Nov 18:05 /etc/pki//
drwxr-xr-x. 1 root root 198 17. Nov 20:29 /etc/ssl//

Additional context The bug was reported multiple times in different constellations:

What I failed to find is a clear description, which files the image looks for.

Or a reason, why it does not bring its own ca-certificates and just mounts the host's certificates in addition, in case someone is using an internal CA.

johanneskastl commented 7 months ago

My guess is that the links inside the directories are messing things up:

$ ll /etc/ssl/
total 56K
lrwxrwxrwx. 1 root root  43 14. Jun 20:05 ca-bundle.pem -> ../../var/lib/ca-certificates/ca-bundle.pem
lrwxrwxrwx. 1 root root  33 14. Jun 20:05 certs -> ../../var/lib/ca-certificates/pem/
-rw-r--r--. 1 root root 412 17. Nov 20:28 ct_log_list.cnf
drwxr-xr-x. 1 root root   0 17. Nov 20:39 engdef.d/
drwxr-xr-x. 1 root root   0 17. Nov 20:39 engines.d/
-rw-r--r--. 1 root root 12K 17. Nov 20:39 openssl-1_1.cnf
-rw-r--r--. 1 root root 13K 17. Nov 20:28 openssl.cnf
-rw-r--r--. 1 root root 13K 17. Nov 20:29 openssl-orig.cnf
drwx------. 1 root root   0 17. Nov 20:28 private/
$ ll /etc/pki/
total 0
drwxr-xr-x. 1 root root 50 22. Nov 18:05 trust/
$ ll /etc/ca-certificates/
total 0
drwxr-xr-x. 1 root root 0 14. Jun 20:05 update.d/
$

I just changed the mount for /etc/ssl/ to mount the host's /var/lib/ca-certificates/pem/ directory to /etc/ssl/certs/ inside the controller, and the upgrade started and finished successfully.

brandond commented 7 months ago

Yeah, sounds like the symlinks outside the mounted path are breaking things.

The idea is that the host CA bundle is more likely to be up-to-date than the image, or (as you said) the update channel may not be trusted by public CA bundles.

As you noted, use of distros with non-standard filesystem layouts will require adjustments to the deployment manifest.

johanneskastl commented 7 months ago

As you noted, use of distros with non-standard filesystem layouts will require adjustments to the deployment manifest.

Sorry, but SLES15 has that, and this is ancient. So I am not sure how "non-standard" that is. ;-)

johanneskastl commented 7 months ago

As you noted, use of distros with non-standard filesystem layouts will require adjustments to the deployment manifest.

Sorry, but SLES15 has that, and this is ancient. So I am not sure how "non-standard" that is. ;-)

RHEL8 has a link from /etc/ssl/certs/ to /etc/pki/tls/certs/, but as that gets mounted separately it might work.

brandond commented 7 months ago

Yeah, the problem here is that content under /etc/ssl/... is linked to /var/lib/ca-certificates/.... /var/lib/ca-certificates is not a standard path and therefore isn't mounted by the default deplyoment manifest. The easiest fix would probably be to add /var/lib/ca-certificates as a mount.

You can see the paths expected by golang at https://go.dev/src/crypto/x509/root_linux.go

johanneskastl commented 7 months ago

Yes, that is exactly what I did. I added a kustomization (as there is unfortunately no helm chart for system-upgrade-controller) to patch the deployment and mount /var/lib/ca-certificates/pem/ to /etc/ssl/certs/

dweomer commented 4 months ago

The default manifest is quite simple but largely inclusive. As such, it makes for a pretty decent, broadly applicable example. I used to hate/fear Helm unreasonably at the time I started this project (now I hate Helm for good reasons, I assure you) and so I never developed a chart!

nate-duke commented 2 months ago

Any guidance on workarounds for this? I've tried making the files in the system path match what i expect is inside the container based on the error messages but without a shell in the container it's down to guesswork.

dweomer commented 2 months ago

Any guidance on workarounds for this? I've tried making the files in the system path match what i expect is inside the container based on the error messages but without a shell in the container it's down to guesswork.

SUC leverages the default TLS implementation that comes with the golang runtime, therefore it searches for trust store as indicated by:

If the host path mounts aren't working this is typically caused by:

So, if curl works on the host but not in the container, you've probably got a symlink problem.

There are a number of ways to fix this because the SUC manifests as provided are demonstrative and not authoritative: you have control over its runtime.