siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.45k stars 514 forks source link

Upgraded Talos VM boots into previous version grub entry after reboot #9088

Closed mwheckmann closed 1 month ago

mwheckmann commented 1 month ago

Bug Report

NOTE: we experienced this on Azure but it could happen on other Clouds or even metal servers.

Talos Grub default not changed after upgrade.

Description

A Talos Azure VM upgraded from 1.6.7 to 1.7.5 boots into the previous version grub entry (1.6.7) after reboot initiated by cluster destroy. This causes the error below in the logs and the VM is stuck.:

Logs

mlx5_core 91ec:00:02.0 eth1: Link up SUBSYSTEM=pci DEVICE=+pci:91ec:00:02.0 hv_netvsc 002248b1-7fdb-0022-48b1-7fdb002248b1 eth0: Data path switched to VF: eth1 SUBSYSTEM=vmbus DEVICE=+vmbus:002248b1-7fdb-0022-48b1-7fdb002248b1 hv_netvsc 002248b1-7fdb-0022-48b1-7fdb002248b1 eth0: Data path switched from VF: eth1 SUBSYSTEM=vmbus DEVICE=+vmbus:002248b1-7fdb-0022-48b1-7fdb002248b1 [talos] controller failed {"component": "controller-runtime", "controller": "network.LinkSpecController", "error": "1 error occurred:\n\t* error enslaving/unslaving link \"eth1\" under \"\": netlink receive: operation not supported\n\n"} [talos] controller failed {"component": "controller-runtime", "controller": "config.AcquireController", "error": "failed to load config from STATE: unknown keys found during decoding:\nmachine:\n features:\n hostDNS:\n enabled: true # Enable host DNS caching resolver.\n"}

Also see:

image

Environment

smira commented 1 month ago

This is not what is happening here for sure, but rather pretty compilcated set of interactions.

  1. Talos always validates the machine config when it gets applied, so there's no chance to get invalid configuration this way.
  2. Talos validates the machine configuration on upgrade/downgrade.

So the scenario like "default hasn't changed" isn't even possible (invalid config would never make its way into 1.6.7).

The only way this can happen is the following:

Long story short is that we probably need to prevent automatic revert with Omni as it only works well with more manual operations.

smira commented 1 month ago

Moved to https://github.com/siderolabs/omni/issues/509