siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.91k stars 556 forks source link

Talos tries (and fails) to apply invalid file patches #9550

Open PrivatePuffin opened 1 month ago

PrivatePuffin commented 1 month ago

Bug Report

Description

When we add a files in a machine config like this:

  files:
    - path: "/etc/cri/conf.d/20-customization.part"
      permissions: 0
      content: |
        [plugins."io.containerd.grpc.v1.cri"]
          enable_unprivileged_ports = true
          enable_unprivileged_icmp = true
        [plugins."io.containerd.grpc.v1.cri".containerd]
          discard_unpacked_layers = false
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          discard_unpacked_layers = false

    - path: "/etc/nfsmount.conf"
      permissions: 420
      content: |
        [ NFSMount_Global_Options ]
        nfsvers=4.2
        hard=True
        noatime=True
        nodiratime=True
        rsize=131072
        wsize=131072
        nconnect=8

notice the lacking "operation" key

On Apply Talos fails to apply the files due to it missing the operation key, logically, and throws the Talos System in a bootloop.

Expected Behavior

There are a few things that could've happened, and one of them should've happened, that should prevent the scope of this issue:

A. Validate the file patches to be at-least valid patches B. If a patch fails, dont try to reapply an already know borking file-patch and keep rebooting C. revert the Apply if a fifepatch is broken

However, none of these options happen. So we end up with a broken system instead.

Logs

Screenshot_2024-10-22_at_19 24 16

Environment

smira commented 1 month ago

The bug here is that the machine config validation is probably incomplete.

The workaround is to apply previous machine config, as apid is running, which will fix this issue

PrivatePuffin commented 1 month ago

The bug here is that the machine config validation is probably incomplete.

The workaround is to apply previous machine config, as apid is running, which will fix this issue

I'm aware how to reverse the issue, no worries... These are testruns/testmachines/testusers :) I'm reporting this mostly to get it fixed more globally.