siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.46k stars 516 forks source link

xfs corruption, but no xfs_repair #8292

Open smira opened 7 months ago

smira commented 7 months ago

Bug Report

XFS partition got corrupted, but Talos didn't run xfs-repair.

Description

Logs

09/02/2024 11:41:47 [talos] task mountEphemeralPartition (1/1): starting
09/02/2024 11:41:47 XFS (sda6): Mounting V5 Filesystem
09/02/2024 11:41:47 XFS (sda6): totally zeroed log
09/02/2024 11:41:48 XFS (sda6): Corruption warning: Metadata has LSN (90:781352) ahead of current LSN (1:0). Please unmount and run xfs_repair (>= v4.3) to resolve.
09/02/2024 11:41:48 XFS (sda6): log mount/recovery failed: error -22
09/02/2024 11:41:48 XFS (sda6): log mount failed
09/02/2024 11:41:48 [talos] task mountEphemeralPartition (1/1): failed: error mounting: 1 error(s) occurred:
09/02/2024 11:41:49  invalid argument
09/02/2024 11:41:50 [talos] phase ephemeral (8/17): failed
09/02/2024 11:41:50 [talos] boot sequence: failed

Environment

smira commented 7 months ago

Filesystem Corruption Detection

  1. Errors EUCLEAN, EINVAL from mount syscall (more errors?).
  2. If META key 'needs_repair' is set
    • set this key early on boot, and remove it once machine enters running & ready
    • user can set this key manually and reboot

Scenario: Talos boots up, mount() finishes successfully, but the filesystem is corrupted, so containerd fails to start, so the META key needs_repair is not removed, and on next reboot Talos will run xfs_repair.

Filesystem Repair

  1. Try mounting the filesystem (temporarily) (to replay the XFS log) [ignore errors].
  2. Run xfs_repair.
  3. If it fails, go to step 1, but next time add -L.
smira commented 7 months ago

Two PRs:

  1. Adds EINVAL to EUCLEAN (backport to 1.6)
  2. Which adds needs_repair flag - 1.7 only.
frezbo commented 4 months ago

Add EIO (-5) also

frezbo commented 4 months ago

Add EIO (-5) also

Handled in #8733

goproslowyo commented 3 months ago

I have unfortunately just ran into this while doing a talosctl upgrade to a node from 1.6.1 to 1.6.7 :(.

I tried booting into a livecd and attempted xfs_repair but that didn't seem to work. I also attempted adding the -L flag afterwards and that also didn't seem to help.

I am not sure how to proceed... I guess I wipe the node and start over?

smira commented 3 months ago

First of all, it's better to submit the logs, otherwise it's shooting in the dark what kind of issue that is.

But yes, on broken hardware xfs might be corrupted beyond repair, so wiping the filesystem is the only way out.

If e.g. only /var is corrupted, and this is a worker, or HA controlplane, a single partition can be wiped while preserving the rest:

talosctl reset -n NODE --system-labels-to-wipe=EPHEMERAL --reboot
goproslowyo commented 3 months ago

Yea, normally I would provide logs but the machine was in a boot loop where xfs_repair failed and then the node would reboot -- I couldn't grab them but probably could have if I was quicker.

I manually reset the node to maintenance mode via the GRUB menu and was able to rejoin the node to the cluster without issue. I am now updated to 1.7.4 so we'll see how it goes :)

Thanks @smira!