reclaim-the-stack / talos-manager

Rails application to help bootstrap Talos Linux on Hetzner servers
MIT License
32 stars 10 forks source link

Intended Talos Linux and k8s upgrade process #7

Closed treylade closed 3 months ago

treylade commented 3 months ago

Hello,

I have another question. I am still running on Talos Linux v1.6.2 and Kubernetes v1.28.6 installed via the Talos Manager on Hetzner Cloud. Now I am figuring out what the most efficient upgrade process would be to upgrade the Talos- and Kubernetes version installed on the nodes without completely purging the cluster.

I tried to manually upgrade the Talos Nodes via e.g.:

talosctl upgrade --wait --debug --nodes control-plane-1 --image ghcr.io/siderolabs/installer:v1.6.7

I get the following error in the upgrade process:

Error: failed to probe bootloader: initrd: expected 1 match, got 0: set gfxmode=auto set gfxpayload=text linux /A/vmlinuz talos.config=https://abc-talos-manager-xyz.herokuapp.com/config?uuid=${uui

The output of /proc/cmdline is:

BOOT_IMAGE=/A/vmlinuz talos.config=https://abc-talos-manager-xyz.herokuapp.com/config?uuid= talos.platform=metal console=ttyS0 console=tty0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512

Troubleshooting:

Question:

dbackeus commented 3 months ago

Hi,

The intention is that talosctl upgrade should work after bootstrapping using Talos Manager (BTW, don't forget to include the --preserve if you're using local persistent volumes).

We had exactly this bug in our own deployment about a year ago, back on Talos version 1.2 and 1.3. We found that the uuid part of the talos.config had been cut off and prevented an upgrade from going through. We haven't really done any OS level upgrades in a long time so I'm not 100% sure if the issue is fixed or not, but IIRC we didn't have any problems with our most recent upgrades.

In any case, here is how we worked around it back then:

  1. Run a pod on the node you want to fix with privileged security. Eg:
kubectl run -n kube-system -i --rm --tty ubuntu --overrides='
{
  "apiVersion": "v1",
  "spec": {
    "nodeSelector": { "kubernetes.io/hostname": "<node-name>" },
    "tolerations": [{ "effect": "NoSchedule", "operator": "Exists" }],
    "containers": [
      {
        "name": "ubuntu",
        "image": "ubuntu:22.04",
        "args": ["bash"],
        "stdin": true,
        "stdinOnce": true,
        "tty": true,
        "securityContext": { "privileged": true }
      }
    ]
  }
}
'  --image=ubuntu:22.04 --restart=Never -- bash

(replace <node-name> with the name of your node)

  1. Mount the boot volume and fix the grub config
mount /dev/nvme0n1p3 /mnt
vi /mnt/grub/grub.cfg
# make your changes and exit
umount  /mnt
exit

If the config URL looks bad you can try fixing it (it should end with ?uuid=${uuid}). Or remove the talos.config statement completely if it still causes issues.

We'll probably look into upgrading our own Talos deployment in 1-2 months time. I'll be sure to verify if this is a problem at that time and see about what we can do to fix it. It might be a bug in Talos itself.

gunnars04 commented 3 months ago

There's always SaaS Talos Omni that takes care of auto updates for you: https://www.siderolabs.com/platform/saas-for-kubernetes/

But I do find the 6x node limit a bit useless in the $10 hobby tier(with 3x control planes). Next is $250/mo for 10x nodes which is too pricy: https://www.siderolabs.com/pricing/

gunnars04 commented 3 months ago

Here's a Talos update video if it helps: https://www.youtube.com/watch?v=7fySw9TPqUU @dbackeus If/when you update, could you please update the docs? :)

treylade commented 3 months ago

@dbackeus Thank you for your quick response. I tried the described upgrade process on a fresh cluster created via Talos Manager using Talos Linux v1.6.7 and Kubernetes 1.29.3 (latest compatible versions).

Just to let you know: The same error persists when running talosctl upgrade.

treylade commented 3 months ago

There's always SaaS Talos Omni that takes care of auto updates for you: https://www.siderolabs.com/platform/saas-for-kubernetes/

But I do find the 6x node limit a bit useless (with 3x control planes). Next is $250/mo for 10x nodes which is too pricy for the hobby tier. https://www.siderolabs.com/pricing/

Thank you for the hint. The pricing doesn't work for my project in the current state.

dbackeus commented 3 months ago

Fixed by: https://github.com/reclaim-the-stack/talos-manager/commit/6b25db1b103a2f5ec5f41fce4c7059f22de6deb3

This should probably be reported as a bug in Talos since their documentation makes use of these query parameters.

But I'm limited on time and have confirmed that this fixes the issue so I'll just leave it at that for now.

@treylade if you want to unbreak your existing cluster you can use the privileged container trick I mentioned above and just remove the ?uuid={uuid} part from the URL. Though I think you'll want to run mount /dev/sda3 /mnt rather than the nvme drive I referenced earlier since you're using Hetzner cloud rather than Hetzner dedicated servers.