siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

[v1.8.0-alpha.1] Failed to load config from SATE: decode error #9000

Open samip5 opened 1 month ago

samip5 commented 1 month ago

Bug Report

Description

Tried running it for my development cluster and it seems something is wonky with state. Not sure if this is known, but anyways.

Logs

10/07/2024 16:28:41 [talos] mapped encrypted partition /dev/sda5 -> /dev/mapper/sda5-encrypted
10/07/2024 16:28:41 XFS (dm-0): Mounting V5 Filesystem dc82c2e9-79b9-4822-98da-3888cc1a5046
10/07/2024 16:28:41 XFS (dm-0): Starting recovery (logdev: internal)
10/07/2024 16:28:41 XFS (dm-0): Ending recovery (logdev: internal)
10/07/2024 16:28:41 [talos] task mountStatePartition (1/1): done, 19.55052758s
10/07/2024 16:28:41 [talos] phase mountSystem (9/11): done, 19.550712396s
10/07/2024 16:28:41 [talos] phase config (10/11): 1 tasks(s)
10/07/2024 16:28:41 [talos] task loadConfig (1/1): starting
10/07/2024 16:28:41 [talos] node identity established {"component": "controller-runtime", "controller": "cluster.NodeIdentityController", "node_id": "PqAgq3qcpXR14igOKIjuY0VG59VzWPmS4TPTFMBekbS"}
10/07/2024 16:28:44 [talos] controller failed {"component": "controller-runtime", "controller": "config.AcquireController", "error": "failed to load config from STATE: decode error: yaml: line 570: could not find expected ':'"}
10/07/2024 16:28:45 [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [fdae:41e4:649b:9303::1]:8091: i/o timeout\""}
10/07/2024 16:28:47 [talos] configuring siderolink connection {"component": "controller-runtime", "controller": "siderolink.ManagerController", "peer_endpoint": "[2a01:<snip>:c012:559::1]:50180", "next_peer_endpoint": ""}
10/07/2024 16:28:47 [talos] siderolink connection configured {"component": "controller-runtime", "controller": "siderolink.ManagerController", "endpoint": "https://omni.<snip>.dev:8090/?grpc_tunnel=false&jointoken=ZZNiuKCtNR6SAsorVlL7uTcfjDo7S5wJbSVR17AjiKE", "node_uuid": "00d03114-0000-0000-0000-e45f011b3298", "node_address": "fdae:41e4:649b:9303:6dee:e1d2:4f99:a65c/64"}
10/07/2024 16:28:47 [talos] created new link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "siderolink", "kind": "wireguard"}
10/07/2024 16:28:47 [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "siderolink", "peers": 1}
10/07/2024 16:28:47 [talos] assigned address {"component": "controller-runtime", "controller": "network.AddressSpecController", "address": "fdae:41e4:649b:9303:6dee:e1d2:4f99:a65c/64", "link": "siderolink"}
10/07/2024 16:28:47 [talos] changed MTU for the link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "siderolink", "mtu": 1280}
10/07/2024 16:28:47 [talos] controller failed {"component": "controller-runtime", "controller": "config.AcquireController", "error": "failed to load config from STATE: decode error: yaml: line 570: could not find expected ':'"}
10/07/2024 16:28:52 [talos] controller failed {"component": "controller-runtime", "controller": "config.AcquireController", "error": "failed to load config from STATE: decode error: yaml: line 570: could not find expected ':'"}
10/07/2024 16:28:53 [talos] adjusting time (slew) by -187.137µs via 10.0.110.1, state TIME_OK, status STA_PLL | STA_NANO {"component": "controller-runtime", "controller": "time.SyncController"}
10/07/2024 16:29:00 [talos] controller failed {"component": "controller-runtime", "controller": "config.AcquireController", "error": "failed to load config from STATE: decode error: yaml: line 570: could not find expected ':'"}
10/07/2024 16:29:06

Environment

smira commented 1 month ago

This looks like a config file which can't be parsed, not sure how this machine got into this state, as Talos doesn't accept invalid configuration.

If you have a way to reproduce, happy to look into.

samip5 commented 1 month ago

If you have a way to reproduce, happy to look into.

Not sure, but how I installed it was using Omni v0.39.0, and enabled pre-releases to be shown. Installed the cluster as normal tru Omni. Ended up with that state, but as I cannot really check the state partition contents easilly, I don't know how one can debug such a thing.

I have since reverted to a stable release instead.

samip5 commented 1 month ago

I may have a hunch, as to what's the cause. The upgrade process skips validation if machine config is missing and I had forgotten to add talos.board=rpi_generic to kernel args when booting off the network.

smira commented 1 month ago

I may have a hunch, as to what's the cause. The upgrade process skips validation if machine config is missing and I had forgotten to add talos.board=rpi_generic to kernel args when booting off the network.

hmm... If you are booting things your own way, it might be something like that. talos.board is not used in Talos 1.7+.

samip5 commented 1 month ago

talos.board is not used in Talos 1.7+.

It is when it tries to upgrade an RPi and fail because it has assumptions that are not correct otherwise. :) The generic installer cannot be used for such a case, or at least it errors out when talos.board is set to rpi_generic.

samip5 commented 1 month ago

If you are booting things your own way, it might be something like that

Using matchbox which last I saw was a supported way of booting Talos. :)