Open grepler opened 3 years ago
bumping this! just went through this experience, and followed this more or less. some of the things I ran into:
127.0.0.1:7445
by default)kubeconfig-in-cluster
) will have it's API server set to 127.0.0.1:7445
, and you might be tempted to change the kubelet port to such.... doesn't work
172.16.1.100:6443
) and this fixed a bunch of operators that were failing to talk to the control plane/opt/cni/bin
serverTLSBootstrap: true
-- you'll be running into some weird issues because the kubelet certificate doesn't have 127.0.0.1
/ localhost
as a SAN -- instead, generate your own CSR & upload it to the k8s API server etc
cfssl
for this -- here's the CSR JSON I used:
{
"hosts": [
"stardust.net.hat.fo",
"localhost",
"127.0.0.1"
],
"CN": "system:node:stardust.net.hat.fo",
"names": [{"O": "system:nodes"}],
"key": { "algo": "ecdsa", "size": 256 } }
10.96.0.0/16
subnet isn't reachable (i.e. services aren't reachable), check the kube-proxy pod to see if it's erroring somewhereyou could 100% automate the certificate provisioning to make it less painful, but meh.
I was trying to follow the above from @grepler & @hatf0. Huge props and I personally really appreciate the effort/work put in for filing this issue.
I'm frankly too stupid and/or impatient for the steps listed though. On top of that I really felt off editing the systemd dropins, creating my own CSR, approving it in the cluster, etc. Luckily I found this gist by @kvaps https://gist.github.com/kvaps/b9b6a8cc07b889a1f60bffc1ceba514d
I was able to tweak it for myself on a debian/wsl2 machine. I did it this way because it let me NOT have to adjust any of the out-of-the-box systemd dropins that were installed along side of kubelet
and all the certs/keys within the copied files effectively worked out of the box after running the script.
before running the script I stopped the kubelet
service (e.g.: systemctl stop kubelet
).
#!/bin/bash -e
# NOTE $VIP and $TARGET
# $VIP is the IP of a control node with talos installed
# $TARGET is the IP/hostname of the machine that you want to install these files to
# I personally set these via direnv, e.g.:
#
# # in .envrc
# source_up
# export VIP=<control plane IP>
# export TARGET=<target machine IP/hostname>
talosctl -n "$VIP" cat /etc/kubernetes/kubeconfig-kubelet > ./kubelet.conf
talosctl -n "$VIP" cat /etc/kubernetes/bootstrap-kubeconfig > ./bootstrap-kubelet.conf
talosctl -n "$VIP" cat /etc/kubernetes/pki/ca.crt > ./ca.crt
sed -i "/server:/ s|:.*|: https://${VIP}:6443|g" \
./kubelet.conf \
./bootstrap-kubelet.conf
clusterDomain=$(talosctl -n "$VIP" get kubeletconfig -o jsonpath="{.spec.clusterDomain}")
clusterDNS=$(talosctl -n "$VIP" get kubeletconfig -o jsonpath="{.spec.clusterDNS}")
# super stupid way for updating the container runtime socket. please update per your usecase
socketPath="/var/run/containerd/containerd.sock"
if ssh root@$TARGET "ls $socketPath" &> /dev/null; then
printf ""
else
socketPath=/var/run/crio/crio.sock
fi
echo "Using socket path: $socketPath"
cat > var-lib-kubelet-config.yaml <<EOT
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
clusterDomain: "$clusterDomain"
clusterDNS: $clusterDNS
runtimeRequestTimeout: "0s"
cgroupDriver: systemd # uhhhh might want to update this for anything else
containerRuntimeEndpoint: unix:/$socketPath
EOT
scp bootstrap-kubelet.conf root@$TARGET:/etc/kubernetes/bootstrap-kubelet.conf
scp kubelet.conf root@$TARGET:/etc/kubernetes/kubelet.conf
ssh root@$TARGET "mkdir -p /etc/kubernetes/pki"
scp ca.crt root@$TARGET:/etc/kubernetes/pki/ca.crt
scp var-lib-kubelet-config.yaml root@$TARGET:/var/lib/kubelet/config.yaml
After running and validating that the files in the scp
commands actually copied things over, I would validate that the files copied over match the ones in your startup config/scripts/dropins. for example on a system running systemd, I would check out /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
.
if everything matches up, systemctl restart kubelet
.
yes there are a bunch of SSH/SCP commands to the root
user of the target node. Im running this script on my main machine, and I added my public key to /root/.ssh/authorized_keys
on $TARGET
. Please forgive the sin, but this is a homelab that I never intend on exposing to the world and I frankly just wanted to get this running and move on. Hopefully this saves someone else time. And if anyone can confirm that this works on a windows node (or something like it) please let us know! I might end up doing that as I have some windows machines that I would like to add.
@andrewrynhard I reached out to you on reddit, this is more-or-less what I really needed. Apologies if talosctl
already has something to do this for us.
Special thanks to @chr0n1x for providing the script! I tested it in Talos 1.7 with k8s 1.30.
But I would like to make a few points:
After adding a non-Talos node to the Talos cluster, automated Kubernetes upgrades will stop working because talosctl
will attempt to connect on port 50000.
Solution: Manual Kubernetes Upgrade Guide
$ talosctl -n 10.40.0.200 upgrade-k8s --to 1.30.1 --dry-run --pre-pull-images=false
automatically detected the lowest Kubernetes version 1.30.1
discovered controlplane nodes ["10.40.0.200" "10.40.0.201" "10.40.0.202"]
discovered worker nodes ["10.40.0.203"]
updating "kube-apiserver" to version "1.30.1"
> "10.40.0.200": starting update
> "10.40.0.201": starting update
> "10.40.0.202": starting update
updating "kube-controller-manager" to version "1.30.1"
> "10.40.0.200": starting update
> "10.40.0.201": starting update
> "10.40.0.202": starting update
updating "kube-scheduler" to version "1.30.1"
> "10.40.0.200": starting update
> "10.40.0.201": starting update
> "10.40.0.202": starting update
updating kube-proxy to version "1.30.1"
> "10.40.0.200": starting update
> skipped in dry-run
> "10.40.0.201": starting update
> skipped in dry-run
> "10.40.0.202": starting update
> skipped in dry-run
updating kubelet to version "1.30.1"
> "10.40.0.200": starting update
> "10.40.0.201": starting update
> "10.40.0.202": starting update
> "10.40.0.203": starting update
failed upgrading kubelet: error updating node "10.40.0.203": error watching service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.40.0.203:50000: connect: connection refused"
It's also important to disable KubePrism, which will not run on a non-Talos node. This will prevent the default CNI (flannel) from starting because the Kubernetes API server will point to 127.0.0.1:7445.
E0809 12:48:40.398492 1 main.go:227] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-nrqph': Get "https://127.0.0.1:7445/api/v1/namespaces/kube-system/pods/kube-flannel-nrqph": dial tcp 127.0.0.1:7445: connect: connection refused
After applying the machine config, you need to update the Kubernetes manifests.
$ talosctl -n 10.40.0.200 get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml
$ kubectl diff -f manifests.yaml
diff -u -N /tmp/LIVE-1090927806/v1.ConfigMap.kube-system.kubeconfig-in-cluster /tmp/MERGED-2252738197/v1.ConfigMap.kube-system.kubeconfig-in-cluster
--- /tmp/LIVE-1090927806/v1.ConfigMap.kube-system.kubeconfig-in-cluster 2024-08-09 17:43:02.905627200 +0500
+++ /tmp/MERGED-2252738197/v1.ConfigMap.kube-system.kubeconfig-in-cluster 2024-08-09 17:43:02.906627204 +0500
@@ -5,7 +5,7 @@
clusters:
- name: local
cluster:
- server: https://127.0.0.1:7445
+ server: https://1.2.3.4:6443
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
users:
- name: service-account
$ kubectl apply --server-side -f manifests.yaml
Unfortunately, the daemonset/kube-flannel
was not updated, and flannel still pointed to 127.0.0.1:7445. This might be a bug, or I might have done something wrong. In any case, I manually modified the KUBERNETES_SERVICE_HOST
and KUBERNETES_SERVICE_PORT
, and flannel started working.
$ kubectl -n kube-system edit ds kube-flannel
After disabling KubePrism and correcting the Kubernetes API server IP and port, the non-Talos node started working successfully.
$ kubectl get no -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
test1 Ready control-plane 24d v1.30.1 10.40.0.200 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
test2 Ready control-plane 24d v1.30.1 10.40.0.201 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
test3 Ready control-plane 24d v1.30.1 10.40.0.202 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
test4 Ready <none> 119m v1.30.3 10.40.0.203 <none> Rocky Linux 9.4 (Blue Onyx) 5.14.0-284.25.1.el9_2.x86_64 containerd://1.7.19
UPD: You can easily deploy HAProxy on non-Talos nodes, e.g.
$ cat /etc/haproxy/conf.d/kubeprism.conf
frontend kubeprism
mode tcp
bind localhost:7445
default_backend k8s_api
backend k8s_api
mode tcp
server lb 1.2.3.4:6443
I was following the steps here to add some k0s nodes to my Talos cluster but then I ran into the issue where k0s by default wants to read the bootstrap config from a specially-named config map, worker-config-default-1.31
in my case.
In Talos 1.8.1, the API server, by default, starts with --enable-admission-plugins=NodeRestriction
and --authorization-mode=Node,RBAC
. The effect is that the NodeRestriction
admission controller now prevents the kubelet from reading config maps.
A simple solution is to create a ClusterRole that allows reading config maps (or even better, a RoleBinding to limit to this specific config map), and a ClusterRoleBind to bind all non-Talos system:node:<node_name>
users to this role.
After that, I create a ConfigMap in Talos, in the kube-system
namespace, that contains the KubeleteConfiguration
, among some other things like the API server address, and start the k0s
as a worker.
It happily joins the cluster and shows up in kubectl get node
, but is entirely invisible to Talos as far as I can tell, i.e. the machine counter on the talosctl dashboard
does not increment.
Thanks for your write-up, it was very helpful. I created an Ansible collection to connect my non-Talos nodes (Raspberry Pi CM3+ nodes) to my Turing Pi 2 - RK1 (community supported) nodes.
But, I'll soon abandon my use of Talos, unfortunately, as I need the hardware acceleration support (using the RockChip rknn and rkmpp). And I don't have the spare time to figure out how to add kernel modules and get them signed. I couldn't even get the DRBD module extension working.
However, I'd like to share my work, in case someone may find it useful: https://gitlab.com/agravgaard/ansible-collection-k8s/-/tree/Talos?ref_type=tags The project will diverge from here (which is why the link is to a tag) as I intend to use Kubespray instead.
Best of luck :)
Adding a Non-Talos Node to a Cluster
This is my personal walkthrough steps for adding a new non-Talos node to a Talos cluster. YMMV.
Description
I love the Talos setup process, but unfortunately I needed a node that could run an alternate runtime (sysbox), so I added a new Ubuntu 20.04 LTS node to the cluster created by Talos. Since the cluster was not created with Kubeadm, I could not use the standard kubeadm join command. As such, these were the steps taken to setup a new node on the cluster:
Installation Steps
Install Requirements
First Install Kubeadm, kubectl and kubelet following the kubernetes documentation for your distribution.
Install CRI-O
I wanted to use CRI-O, and getting the right version of CRI-O is pretty complicated at the moment, I've found that all the guides are slightly outdated.
However, they are regularly packaged by folks over at OpenSUSE, so go to the link below to get the appropriate version for your kubernetes cluster. Note that you can edit the version in the URL to go directly to the appropriate package.
https://build.opensuse.org/package/show/devel:kubic:libcontainers:stable:cri-o:1.21/cri-o
Click the 'Download Package' button and then 'Add repository and install manually' to get an accurate, up-to-date list of commands to add the correct repository for the OS and OS version you are using.
You can then run the following to check that the correct package is being referenced.
sudo apt search cri-o
Take a snapshot of the machine and then install the runtime.
Get Bootstrap Token
run the following:
kubeadm token create --print-join-command
Copy the token string. This will be used to update the bootstrap file on the new node in a later step. Note that these are only valid for 24 hours.Run Admin Join Command
observe that the kubeadm join command fails:
kubeadm join 10.1.7.1:6443 --token cyNNNd.5ig...zxmq4 --discovery-token-ca-cert-hash sha256:ae6..ea65
This will likely fail, Talos OS support indicates that you should manually configure Kubelet. Proceed to the next steps.We will be following this document: https://medium.com/@toddrosner/kubernetes-tls-bootstrapping-cf203776abc7
Retrieve Talos Kubernetes configuration
you can copy out the running kubernetes configuration directory using:
talosctl -n 10.1.7.5 copy /etc/kubernetes -|talos_kubernetes.tar.gz
This will save an archive of the directory from the .5 worker to the current working directory.We want three files / folders from this archive:
place the files on the new node
Copy these Files to /etc/kubernetes. Ensure that these three files are copied to the new node at /etc/kubernetes.
bootstrap-kubeconfig kubelet.yaml /pki/ca.crt
Instruct the kubelet to bootstrap TLS
you should add a new line to kubelet.yaml:
serverTLSBootstrap: true
This is in line with: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#renew-certificates-with-the-kubernetes-certificates-apiUpdate the Bootstrap Token
Update the bootstrap-kubeconfig file on the new node with the token that was just generated in the steps above.
Update the Service Configuration
Since the service runs using systemd, we need to adjust the settings used by the service. This was the thing that worked for me (not I am using sysbox runtime).
Installing kubeadm and kubectl adds a drop-in configuration file which we can adjust so that it aligns with the standard Talos file locations.
Edit the
--bootstrap-kubeconfig
,--kubeconfig
and--config
flag paths to the settings below.Also add the following flags to KUBELET_KUBECONFIG_ARGS to enable CRI-O:
Once complete, it should look like this:
Reload Services
Reload the systemd services to apply the changes:
systemctl daemon-reload
Then confirm that the changes were applied:
Get Service Definition from Systemd
You can confirm the service configuration settings with the following:
systemctl cat kubelet
Inside the
cat
results, you will look for the referenced file following the--bootstrap-kubeconfig
flag.Check Systemd Drop-Ins
Finally, check that the drop-in systemd extensions are being applied:
should can confirm that the drop-ins are applied using:
systemd-delta --type=extended
Setup Sysbox Runtime
Now that we have the new node joined to the cluster, we can follow through with the installation of the Sysbox OCI runtime environment, to facilitate container virtual machines.
Following these instructions: https://github.com/nestybox/sysbox/blob/master/docs/user-guide/install-k8s.md
System Requirements for storage, etc.
Make sure your system has all the required dependencies for your applications. In my case, I needed to install the nfs client.
Install NFS Common Client Software
Ensure that the node is able to run the NFS client:
apt install nfs-common