Closed LarsBingBong closed 2 years ago
@LarsBingBong Thanks for opening the issue, from the logs there seems to be a misconfigured kubelet certificate on the api node, according to the code the node 192.168.23.70 (The API Node) its kubelet should have cert configured with the right IP SANS https://github.com/k3s-io/k3s/blob/ce5b9347c928336cff13873d2ddeaaeb68d42322/pkg/server/router.go#L214-L232
can you get the following information to investigate more on whats causing the issue:
I am not sure if CCM is relevant here, since it only affects --node-external-ip flag, but private IP should be set automatically
It sounds like the 192.168.23.70 and 192.168.23.71 addresses are both present on the server, is that correct?
In addition to what @galal-hussein asked for, can you also provide the output of kubectl get node -o wide
, a list of the IPs present on all of the nodes interfaces, and any load-balancer/VIP address configured in your environment and what they're load-balancing to?
I tested again with your great input. Much appreciated @brandond and @galal-hussein - and here's what I found when deploying a cluster when --disable-cloud-controller
is set again.
Indeed *.23.70
and *.23.71
are on the same node. *.23.70
is the API IP. IT's a keepalived
the API IP is handled via keepalived
. That vip
can float between the master/control-plane
nodes. So *.23.71
, .23.72` & .23.73 And yes, the two IPv4s are assigned to the same NIC on the
control-plane` nodes.
kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
test-test-master-0 Ready control-plane,etcd,master 7m18s v1.23.8+k3s2 192.168.23.70 <none> Ubuntu 20.04.4 LTS 5.13.0-37-generic containerd://1.5.13-k3s1
test-test-master-1 Ready control-plane,etcd,master 5m v1.23.8+k3s2 192.168.23.72 <none> Ubuntu 20.04.4 LTS 5.13.0-37-generic containerd://1.5.13-k3s1
test-test-master-2 Ready control-plane,etcd,master 5m16s v1.23.8+k3s2 192.168.23.73 <none> Ubuntu 20.04.4 LTS 5.13.0-37-generic containerd://1.5.13-k3s1
test-test-worker-0 Ready worker 3m34s v1.23.8+k3s2 192.168.23.77 <none> Ubuntu 20.04.4 LTS 5.13.0-37-generic containerd://1.5.13-k3s1
test-test-worker-1 Ready worker 3m35s v1.23.8+k3s2 192.168.23.78 <none> Ubuntu 20.04.4 LTS 5.13.0-37-generic containerd://1.5.13-k3s1
test-test-worker-2 Ready worker 3m34s v1.23.8+k3s2 192.168.23.81 <none> Ubuntu 20.04.4 LTS 5.13.0-37-generic containerd://1.5.13-k3s1
Output of: openssl x509 -in /k3s-data/server/tls/serving-kube-apiserver.crt -noout -text
< I hope this is what you asked for/had in mind @galal-hussein
root@test-test-master-0:~# openssl x509 -in /k3s-data/server/tls/serving-kube-apiserver.crt -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 7993918466325878850 (0x6ef01a7dd4ca1842)
Signature Algorithm: ecdsa-with-SHA256
Issuer: CN = k3s-server-ca@1658740937
Validity
Not Before: Jul 25 09:22:17 2022 GMT
Not After : Jul 25 09:22:17 2023 GMT
Subject: CN = kube-apiserver
Subject Public Key Info:
Public Key Algorithm: id-ecPublicKey
Public-Key: (256 bit)
pub:
04:be:0e:5a:8a:47:ae:be:28:64:1a:47:4d:c1:cd:
71:bd:dc:a1:c5:d6:03:19:42:36:2f:23:c4:37:25:
79:34:f6:6f:78:12:c3:c6:4e:c9:5f:f3:fc:16:7e:
c1:5a:da:20:fd:b6:e1:bf:68:0c:b3:dc:3c:bd:34:
51:7b:9f:ce:ba
ASN1 OID: prime256v1
NIST CURVE: P-256
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication
X509v3 Authority Key Identifier:
keyid:99:48:6A:2C:6B:9F:D8:45:CA:13:76:A4:A4:6B:00:CE:13:45:D5:DA
X509v3 Subject Alternative Name:
DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost, DNS:test-test-master-0, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1, IP Address:192.168.23.71, IP Address:10.43.0.1
Signature Algorithm: ecdsa-with-SHA256
30:46:02:21:00:f1:9d:85:4b:90:01:16:2a:05:92:85:24:1b:
8a:84:68:ba:42:38:f9:5a:43:21:c5:45:7d:5b:f5:2b:10:5a:
6f:02:21:00:a3:c6:71:e5:c5:7e:a5:ca:e2:28:88:12:b4:45:
99:60:71:84:17:95:95:87:4c:08:b9:31:60:fe:6b:1f:82:cd
root@test-test-master-0:~#
I now see that the test-test-master-0
have its INTERNAL-IP
to 192.168.23.70
. That of course do not work well with the cert.
Just controlled kubectl get node -o wide
on a K3s
cluster on v1.22.5+k3s1
and there, the *-master-0
is not announcing the INTERNAL-IP
to be the API IP.
And on the node/control-plane that is the current owner of the keepalived
floating vip
- the INTERNAL_IP
is the nodes IP and not the API IP.
Is it because we now need to use the --node-ip
combined with maybe the --advertise-address
to get this to work on K3s
v1.23.8+k3s1++++
. Now that we're trying to disable the use of the ccm - as we don't really need it in our current situation.
@galal-hussein how can I control, on a cluster using the ccm how it's effecting the --node-external-ip
flag? As we're not configuring/setting this flag on any cluster.
Thank you 👍🏿
Had another run with deploying a K3s
v1.23.8+k3s2
cluster with --disable-cloud-controller
active. This time however I used the --node-ip
on each control-plane node in order to specify the IPv4
that the agent
should announce.
That didn't make any change relative to the flags we use when deploying K3s
- for good measure the flags where now:
- "--node-ip=192.168.23.71"
- "--node-taint CriticalAddonsOnly=true:NoExecute"
- "--data-dir=/k3s-data"
- "--disable=coredns"
- "--disable-cloud-controller"
- "--disable-kube-proxy"
- "--disable=local-storage"
- "--disable-network-policy"
- "--disable=servicelb"
- "--disable=traefik"
- "--kube-apiserver-arg=audit-log-path=/var/lib/rancher/audit/audit.log"
- "--kube-apiserver-arg=audit-policy-file=/var/lib/rancher/audit/audit-policy.yaml"
- "--kube-apiserver-arg=audit-webhook-config-file=/var/lib/rancher/audit/webhook-config.yaml"
- "--kube-apiserver-arg=audit-log-maxage=30"
- "--kube-apiserver-arg=audit-log-maxsize=20"
- "--kube-apiserver-arg=audit-log-maxbackup=6"
# To be set to [true]. So that the Cilium CNI & Falco can do their magic. And for the clusters that need, it KubeVirt to work
- "--kube-apiserver-arg=allow-privileged=true"
Thanks
@LarsBingBong Thanks for providing the information, so as for your question:
@galal-hussein how can I control, on a cluster using the ccm how it's effecting the --node-external-ip flag? As we're not configuring/setting this flag on any cluster.
The internal cloud provider we set, only affect the node's Addresses when the --node-external-ip is set, to explain this, you need to know that each node has an Adresses field in its status, this addresses field contain a structure of 3 fields:
InternalAddress
Hostname
ExternalAddress
The kubelet on each node configures this automatically according to the ips and hostname on the nodes, for the externaladdress it only gets assigned by an external cloud provider, in this case I dont think its relevant to what are you trying to do.
Seeing now the problem I think its simply that the worker trying to communicate to an IP that is not listed in the SANs, I would like to test something, can you configure the node with the following:
- "--node-ip=192.168.23.70"
- "--node-ip=192.168.23.71"
This way the kubelet's serving cert should be configured with both.
Hi @galal-hussein,
Okay sure. Interesting. Thank you for further elaborating.
In regards to using --node-ip
with both IPv4's specified. It seems logical to me that the --node-ip=192.168.23.70
AKA the API IP
needs to be configured on all control-plane
nodes - as that IP is floating/vip and handled by keepalived
- in other words is drifting between them. Right?
Thank you
N.B. yes I'll surely try it out.
@LarsBingBong yes for sure, I meant for this configuration --node-ip=x.x.x.70 is to be configured on all cp nodes
I tried that ....
The first master comes up ( the one with --cluster-init
. Then the next two do not.
And here's the result from one of the failing master:
Jul 26 20:18:47 test-test-master-1 systemd[1]: k3s.service: Failed with result 'exit-code'.
Jul 26 20:18:47 test-test-master-1 systemd[1]: Failed to start Lightweight Kubernetes.
Jul 26 20:18:52 test-test-master-1 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 394.
Jul 26 20:18:52 test-test-master-1 systemd[1]: Stopped Lightweight Kubernetes.
Jul 26 20:18:52 test-test-master-1 systemd[1]: Starting Lightweight Kubernetes...
Jul 26 20:18:52 test-test-master-1 sh[103324]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Jul 26 20:18:52 test-test-master-1 sh[103325]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Starting k3s v1.24.3+k3s1 (990ba0e8)"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Managed etcd cluster not yet initialized"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Reconciling bootstrap data between datastore and disk"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Running kube-apiserver --advertise-address=192.168.23.70 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --audit-log-maxage=30 --audit-log-maxbackup=6 --audit-log-maxsize=20 --audit-log-path=/var/lib/rancher/audit/audit.log --audit-policy-file=/var/lib/rancher/audit/audit-policy.yaml --audit-webhook-config-file=/var/lib/rancher/audit/webhook-config.yaml --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/k3s-data/server/tls/temporary-certs --client-ca-file=/k3s-data/server/tls/client-ca.crt --egress-selector-config-file=/k3s-data/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=true --etcd-cafile=/k3s-data/server/tls/etcd/server-ca.crt --etcd-certfile=/k3s-data/server/tls/etcd/client.crt --etcd-keyfile=/k3s-data/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=JobTrackingWithFinalizers=true --kubelet-certificate-authority=/k3s-data/server/tls/server-ca.crt --kubelet-client-certificate=/k3s-data/server/tls/client-kube-apiserver.crt --kubelet-client-key=/k3s-data/server/tls/client-kube-apiserver.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/k3s-data/server/tls/client-auth-proxy.crt --proxy-client-key-file=/k3s-data/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/k3s-data/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/k3s-data/server/tls/service.key --service-account-signing-key-file=/k3s-data/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/k3s-data/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/k3s-data/server/tls/serving-kube-apiserver.key"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Tunnel server egress proxy mode: agent"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Running kube-scheduler --authentication-kubeconfig=/k3s-data/server/cred/scheduler.kubeconfig --authorization-kubeconfig=/k3s-data/server/cred/scheduler.kubeconfig --bind-address=127.0.0.1 --kubeconfig=/k3s-data/server/cred/scheduler.kubeconfig --profiling=false --secure-port=10259"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Running kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/k3s-data/server/cred/controller.kubeconfig --authorization-kubeconfig=/k3s-data/server/cred/controller.kubeconfig --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-kube-apiserver-client-cert-file=/k3s-data/server/tls/client-ca.crt --cluster-signing-kube-apiserver-client-key-file=/k3s-data/server/tls/client-ca.key --cluster-signing-kubelet-client-cert-file=/k3s-data/server/tls/client-ca.crt --cluster-signing-kubelet-client-key-file=/k3s-data/server/tls/client-ca.key --cluster-signing-kubelet-serving-cert-file=/k3s-data/server/tls/server-ca.crt --cluster-signing-kubelet-serving-key-file=/k3s-data/server/tls/server-ca.key --cluster-signing-legacy-unknown-cert-file=/k3s-data/server/tls/server-ca.crt --cluster-signing-legacy-unknown-key-file=/k3s-data/server/tls/server-ca.key --feature-gates=JobTrackingWithFinalizers=true --kubeconfig=/k3s-data/server/cred/controller.kubeconfig --profiling=false --root-ca-file=/k3s-data/server/tls/server-ca.crt --secure-port=10257 --service-account-private-key-file=/k3s-data/server/tls/service.key --use-service-account-credentials=true"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Node token is available at /k3s-data/server/token"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="To join node to cluster: k3s agent -s https://192.168.23.72:6443 -t ${NODE_TOKEN}"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Wrote kubeconfig /etc/rancher/k3s/k3s.yaml"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Run: k3s kubectl"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="certificate CN=test-test-master-1 signed by CN=k3s-server-ca@1658849056: notBefore=2022-07-26 15:24:16 +0000 UTC notAfter=2023-07-26 18:18:52 +0000 UTC"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="certificate CN=system:node:test-test-master-1,O=system:nodes signed by CN=k3s-client-ca@1658849056: notBefore=2022-07-26 15:24:16 +0000 UTC notAfter=2023-07-26 18:18:52 +0000 UTC"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Module overlay was already loaded"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Module nf_conntrack was already loaded"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Module br_netfilter was already loaded"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Module iptable_nat was already loaded"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Logging containerd to /k3s-data/agent/containerd/containerd.log"
Jul 26 20:18:52 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:52+02:00" level=info msg="Running containerd -c /k3s-data/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /k3s-data/agent/containerd"
Jul 26 20:18:53 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:53+02:00" level=info msg="Containerd is now running"
Jul 26 20:18:53 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:53+02:00" level=info msg="Running kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/k3s-data/agent/client-ca.crt --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=test-test-master-1 --kubeconfig=/k3s-data/agent/kubelet.kubeconfig --node-ip=192.168.23.70 --node-labels= --pod-infra-container-image=rancher/mirrored-pause:3.6 --pod-manifest-path=/k3s-data/agent/pod-manifests --read-only-port=0 --register-with-taints=CriticalAddonsOnly=true:NoExecute --resolv-conf=/run/systemd/resolve/resolv.conf --serialize-image-pulls=false --tls-cert-file=/k3s-data/agent/serving-kubelet.crt --tls-private-key-file=/k3s-data/agent/serving-kubelet.key"
Jul 26 20:18:53 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:53+02:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Jul 26 20:18:53 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:53+02:00" level=info msg="Handling backend connection request [test-test-master-1]"
Jul 26 20:18:53 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:53+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Jul 26 20:18:57 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:57+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Jul 26 20:18:58 test-test-master-1 k3s[103328]: time="2022-07-26T20:18:58+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Jul 26 20:19:02 test-test-master-1 k3s[103328]: {"level":"warn","ts":"2022-07-26T20:19:02.335+0200","logger":"etcd-client","caller":"v3@v3.5.3-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cb8540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Jul 26 20:19:02 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:02+02:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Jul 26 20:19:02 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:02+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Jul 26 20:19:03 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:03+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Jul 26 20:19:07 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:07+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Jul 26 20:19:08 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:08+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Jul 26 20:19:12 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:12+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Jul 26 20:19:13 test-test-master-1 k3s[103328]: {"level":"warn","ts":"2022-07-26T20:19:13.390+0200","logger":"etcd-client","caller":"v3@v3.5.3-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00066ea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Jul 26 20:19:13 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:13+02:00" level=error msg="Failed to get member list from etcd cluster. Will assume this member is already added"
Jul 26 20:19:13 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:13+02:00" level=info msg="Starting etcd to join cluster with members [test-test-master-0-b1b10f3e=https://192.168.23.70:2380 test-test-master-1-26b71735=https://192.168.23.70:2380]"
Jul 26 20:19:13 test-test-master-1 k3s[103328]: {"level":"info","ts":"2022-07-26T20:19:13.392+0200","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://127.0.0.1:2380","https://192.168.23.70:2380"]}
Jul 26 20:19:13 test-test-master-1 k3s[103328]: {"level":"info","ts":"2022-07-26T20:19:13.392+0200","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /k3s-data/server/tls/etcd/peer-server-client.crt, key = /k3s-data/server/tls/etcd/peer-server-client.key, client-cert=, client-key=, trusted-ca = /k3s-data/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
Jul 26 20:19:13 test-test-master-1 k3s[103328]: {"level":"info","ts":"2022-07-26T20:19:13.392+0200","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"test-test-master-1-26b71735","data-dir":"/k3s-data/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.23.70:2379"]}
Jul 26 20:19:13 test-test-master-1 k3s[103328]: {"level":"info","ts":"2022-07-26T20:19:13.392+0200","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"test-test-master-1-26b71735","data-dir":"/k3s-data/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.23.70:2379"]}
Jul 26 20:19:13 test-test-master-1 k3s[103328]: time="2022-07-26T20:19:13+02:00" level=fatal msg="ETCD join failed: listen tcp 192.168.23.70:2380: bind: cannot assign requested address"
Jul 26 20:19:13 test-test-master-1 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jul 26 20:19:13 test-test-master-1 systemd[1]: k3s.service: Failed with result 'exit-code'.
Jul 26 20:19:13 test-test-master-1 systemd[1]: Failed to start Lightweight Kubernetes.
K3s server
retry startup loop from journalctl -u k3s.server -f
.Also note .. full flags list to the K3s server
binary.
N.B. yes I had a go with
K3s
v1.24.3+k3s1
--flannel-backend=none
flag was not previously documented in this issue.I hope the above can give us something.
Thanks
In regards to using --node-ip with both IPv4's specified. It seems logical to me that the --node-ip=192.168.23.70 AKA the API IP needs to be configured on all control-plane nodes - as that IP is floating/vip and handled by keepalived - in other words is drifting between them. Right?
for this configuration --node-ip=x.x.x.70 is to be configured on all cp nodes
I don't think this will work. Kubernetes does not support multiple nodes having the same internal or external IP address. If you're going to use a floating VIP with keepalived, Kubernetes needs to be essentially unaware of it. Don't use it for the internal or external IP; make sure that it's not picked up by any address auto-selection. The only reference to it should be in the --tls-san
.
A better way to do this might be to simply configure a DNS alias that points to active control-plane nodes, and use that as the fixed registration endpoint. Using keepalived just to support the fixed registration endpoint is probably overkill, as the nodes load-balance between servers using a client load-balancer once they're joined to the cluster. The registration endpoint (--server
address) is essentially unused after the initial join workflow is done.
Hi @brandond,
Thank you for further elaborating.
How much of this is totally specific to K3s?
Before bumping into the issue described here we did not specify the API IP in any way to Kubernetes. It's "just" registered as an extra IPv4 on the same NIC as the one being assigned the nodes external IPv4.
What mechanisms are in play in regards to the address auto-selection you are mentioning? I tried --tls-san=192.168.23.70
when configuring the control-plane nodes. The error still occurred. I mention that in the initial post here.
We use a tool called kcli
to deploy K3s and the underlying VM's. And he's talking about using kube-vip
instead of keepalived
. Do you see that as a way to go?
What's really interesting - I think - is that when we don't disable the CCM things works. What's the down low on that being the case.
Thank you very much.
How much of this is totally specific to K3s?
Not much of it. The kubelet has logic to detect the node's primary internal IP based on which interface the default route is associated with. External IPs pretty much always need to be set by an external integration.
when we don't disable the CCM things works.
The CCM is what's responsible for setting the node addresses based on the configured node-ip and node-external-ip values. If you disable it, the internal IP will be set, but the external IP will not. New nodes added to the cluster will also remain tainted as uninitialized due to lack of a cloud provider.
Under no circumstances should you ever have multiple nodes in the cluster with the same internal or external address. That is not an expected configuration.
Hi @brandond,
Again thank you. Hmm so reading https://kubernetes.io/docs/concepts/architecture/cloud-controller/ leads me to think that I can't disable the CCM that comes out-of-the-box with K3s. Basically what I'm trying to accomplish is having the CCM disabled as I was of the belief that it isn't needed now that we're not using a cloud provider. If that's a disbelief so be it :-) ....
When things work we:
N.B.: Having several nodes with the same IPv4 and that failing makes sense yes. Tried it because it was suggested.
I was of the belief that it isn't needed now that we're not using a cloud provider.
Kubernetes really, really expects to have a cloud provider active to handle configuring the nods properly. There are many things the kubelet can't do for itself. In bare-metal environments, or other situations where you don't want to integrate with a "real" cloud provider such as AWS, GKE, Azure, you need something like K3s' embedded stub cloud-provider. Really the only time you would ever want to disable it is when you're deploying a real cloud-provider chart instead.
What a n00b I've been here. But, super great to learn this. Even though it's been the hard way ;-). Had a dive into the cloudControllerManager
func. > https://github.com/k3s-io/k3s/blob/master/pkg/daemons/control/server.go#L299 and yeah ... I can see different low-level network wise is going on. Good to know and thank you for the patience! Much appreciated.
Clearly this issue can be closed as resolved.
Disabling the CCM was basically the issue in this issue as it had several negative side-effects.
Thank you @brandond and @galal-hussein
Environmental Info: K3s Version:
Node(s) CPU architecture, OS, and Version:
Linux test-test-master-0 5.13.0-37-generic #42~20.04.1-Ubuntu SMP Tue Mar 15 15:44:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Describe the bug: When deploying K3s v1.23.8+k3s2 or higher from the v1.23 channel we see certificate related errors in Cilium daemonset Pods as well as the metrics-server that comes with K3s. This has the effect that e.g.
cilium status ...
consider the cluster to not be fully working and that's a problem on the network backend side.192.168.23.70 is the API IP.
Steps To Reproduce:
flags used
CoreDNS is v1.9.3 and comes up successfully. Yes we generate the
NodeHosts
section of thekube-dns/k3s-CoreDNS
right after having deployed the cluster as the internal CoreDNS workloads isdisabled
in the master args.Expected behavior: Same behavior as on v1.23.6+k3s1 and below. In other words. That we don't see the wrong certificate served to workloads.
Actual behavior: It seems to be that that is the case - that workloads are served the wrong certificate. The one on port
10250
and not6443
.As the cert. on
6443
includes the API IP in the SAN's and the one on10250
do not. Which can be seen by browsing e.g.192.168.23.70:10250
and192.168.23.70:6443
>> then view the certificate details in the browser.Additional context / logs:
- "--tls-san=192.168.23.70"
set. That made no difference.And lo and behold. Now there's no error. Am I missing configuration needed in order to make it possible to disable the CCM? Isn't enough to specify
--disable-cloud-controller
to the masters? Or am I actually totally wrong in trying to disable the CCM if I'm not to replace it by some external CCM? I was of the impression that it isn't needed if we're not running on the cloud and therefore do not need specific integration to the cloud providers service plane. Am I wrong?Thank you very much.