Open smira opened 9 months ago
EUCLEAN
, EINVAL
from mount
syscall (more errors?).Scenario: Talos boots up,
mount()
finishes successfully, but the filesystem is corrupted, socontainerd
fails to start, so the META keyneeds_repair
is not removed, and on next reboot Talos will runxfs_repair
.
xfs_repair
.-L
.Two PRs:
EINVAL
to EUCLEAN
(backport to 1.6)needs_repair
flag - 1.7 only.Add EIO
(-5) also
Add
EIO
(-5) also
Handled in #8733
I have unfortunately just ran into this while doing a talosctl upgrade to a node from 1.6.1 to 1.6.7 :(.
I tried booting into a livecd and attempted xfs_repair
but that didn't seem to work. I also attempted adding the -L
flag afterwards and that also didn't seem to help.
I am not sure how to proceed... I guess I wipe the node and start over?
First of all, it's better to submit the logs, otherwise it's shooting in the dark what kind of issue that is.
But yes, on broken hardware xfs
might be corrupted beyond repair, so wiping the filesystem is the only way out.
If e.g. only /var
is corrupted, and this is a worker, or HA controlplane, a single partition can be wiped while preserving the rest:
talosctl reset -n NODE --system-labels-to-wipe=EPHEMERAL --reboot
Yea, normally I would provide logs but the machine was in a boot loop where xfs_repair
failed and then the node would reboot -- I couldn't grab them but probably could have if I was quicker.
I manually reset the node to maintenance mode via the GRUB menu and was able to rejoin the node to the cluster without issue. I am now updated to 1.7.4 so we'll see how it goes :)
Thanks @smira!
Bug Report
XFS partition got corrupted, but Talos didn't run xfs-repair.
Description
Logs
Environment
talosctl version --nodes <problematic nodes>
]kubectl version --short
]