Unable to get pod logs due to certificate IP mismatch with fetch URL

cortices commented 2 years ago

Environmental Info: K3s Version:
k3s version v1.24.4+k3s-7d66e419-dirty (7d66e419) go version go1.18.5

Node(s) CPU architecture, OS, and Version:
Linux moria 5.15.62+truenas #1 SMP Mon Sep 12 18:25:31 UTC 2022 x86_64 GNU/Linux

Cluster Configuration: 1 server, administered via TrueNAS k3s Data dir in /mnt/Knapsack/ix-applications/k3s (default location for TrueNAS on chosen App pool).

Describe the bug: I'm unable to view logs for pods using k3s kubectl logs because a certificate error is returned and I can't work out why.

Steps To Reproduce:

Installed K3s: TrueNAS base k3s
Tried to get logs for App from TrueNAS interface. Receive the same error as below.

SSH in as root, check k3s data dir setup and pod names, and try to get logs using command line.

root@moria[/mnt/Knapsack/ix-applications/k3s]# k3s kubectl logs -n ix-synapse synapse-ix-chart-6844f68c96-gbd9j
Error from server: Get "https://192.168.0.6:10250/containerLogs/ix-synapse/synapse-ix-chart-6844f68c96-gbd9j/ix-chart": x509: certificate is valid for 127.0.0.1, 0.0.0.0, not 192.168.0.6

Expected behavior: Logs returned in terminal.

Actual behavior: This error is produced: Error from server: Get "https://192.168.0.6:10250/containerLogs/ix-synapse/synapse-ix-chart-6844f68c96-gbd9j/ix-chart": x509: certificate is valid for 127.0.0.1, 0.0.0.0, not 192.168.0.6

Additional context / logs: I have tried rotating my certs (as described in #6204), and restarting the k3s systemd service and the whole machine. The same issue arises. I have not specifically configured k3s to use the local network IP 192.168.0.6 to refer to the cluster, and it's not in the /etc/rancher/k3s/config.yaml file, so I'm not sure where this IP address is being selected from.

Config file `/etc/rancher/k3s/config.yaml`

I haven't modified this file from default TrueNAS installation but it may have issues?

cluster-cidr: 172.16.0.0/16
cluster-dns: 172.17.0.10
data-dir: /mnt/Knapsack/ix-applications/k3s
disable: []
kube-apiserver-arg:
- service-node-port-range=9000-65535
- enable-admission-plugins=NodeRestriction,NamespaceLifecycle,ServiceAccount
- audit-log-path=/var/log/k3s_server_audit.log
- audit-log-maxage=30
- audit-log-maxbackup=10
- audit-log-maxsize=100
- service-account-lookup=true
- feature-gates=MixedProtocolLBService=true
kube-controller-manager-arg:
- node-cidr-mask-size=16
- terminated-pod-gc-threshold=5
kubelet-arg:
- max-pods=250
node-ip: 0.0.0.0
protect-kernel-defaults: true
service-cidr: 172.17.0.0/16

brandond commented 2 years ago

This sounds similar to https://github.com/k3s-io/k3s/issues/6102#issuecomment-1240224958 - do you have a proxy server in your environment that K3s is going through?

cortices commented 2 years ago

Not that I am aware of. Services work normally as I'd expect in local network. The "INTERNAL-IP" value from k3s kubectl get node -o wide is 192.168.0.6, which is the local ipv4 of the host.

brandond commented 2 years ago

Check openssl x509 -noout -text -in /var/lib/rancher/k3s/agent/serving-kubelet.crt on the node that it's reporting the error for. If the IP SANs on that cert don't match the entries that the log is complaining about (and I expect they won't, as we don't add 0.0.0.0 as an IP SAN) then the error is due to K3s connecting to something else - most likely a proxy. Check /etc/systemd/system/k3s.service.env and /etc/systemd/system/k3s.service on the server for any proxy-related environment variables.

cortices commented 2 years ago

The SAN list sure has 0.0.0.0. Wonky. Reckon this is looking like TrueNAS's fault.

X509v3 Subject Alternative Name: 
                DNS:ix-truenas, DNS:localhost, IP Address:127.0.0.1, IP Address:0.0.0.0

I checked the systemd launcher and all it does is disable a number of features afaict.

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--flannel-backend=none' \
        '--disable=traefik,metrics-server,local-storage' \
        '--disable-kube-proxy' \
        '--disable-network-policy' \
        '--disable-cloud-controller' \
        '--node-name=ix-truenas' \
        '--docker'

brandond commented 2 years ago

Hmm, that's not right. 0.0.0.0 doesn't make sense as a SAN. I see that you've also got flannel and the built-in cloud-controller disabled, so something else must be setting the node IPs incorrectly - that's normally the job of the embedded cloud-controller. I suspect that something's been changed to use 0.0.0.0 as the kubelet bind address, instead of the node's primary IP, and that's also replacing the address that the cert is valid for.

cortices commented 2 years ago

Closing this as it's definitely a TrueNAS bug. Setting a Static IP for the host and then selecting it as the cluster IP solved this.

PrivatePuffin commented 1 year ago

@cortices For future reference: Please try to report issues to your vendor/downstream (in this case iX-Systems/TrueNAS) first.

k3s-io / k3s