siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.88k stars 554 forks source link

Worker node is stuck in grub rescue mode after talosctl reboot #9407

Closed lwbt closed 1 month ago

lwbt commented 1 month ago

Bug Report

Description

I ran talosctl reboot to reboot a worker node or an entire cluster. Worker nodes 1 and 2 were unable to successfully reboot. The controller node rebooted just fine. Worker node 3 had a different issue, which I intended solve with the reboot, but it was actually not connected. I used worker 3 to reproduce the issue, see details below.

Logs

Environment


$  talosctl reboot --nodes 192.168.8.23 --endpoints 192.168.8.11 --talosconfig=./talosconfig
◰ watching nodes: [192.168.8.23]
    * 192.168.8.23: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 9 in reboot sequence: task 1/1: failed, error mounting partitions: error mounting /dev/nvme0n1p3: 1 error(s) occurred:
    error repairing: xfs_repair: exit status 1: 5
cleared inode 155
UUID mismatch on inode 156
cleared inode 156
UUID mismatch on inode 157
cleared inode 157
UUID mismatch on inode 158
cleared inode 158
UUID mismatch on inode 159
cleared inode 159
imap claims inode 160 is present, but inode cluster is sparse, correcting imap
imap claims inode 161 is present, but inode cluster is sparse, correcting imap
imap claims inode 162 is present, but inode cluster is sparse, correcting imap
imap claims inode 163 is present, but inode cluster is sparse, correcting imap
imap claims inode 164 is present, but inode cluster is sparse, correcting imap
imap claims inode 165 is present, but inode cluster is sparse, correcting imap
imap claims inode 166 is present, but inode cluster is sparse, correcting imap
imap claims inode 167 is present, but inode cluster is sparse, correcting imap
imap claims inode 168 is present, but inode cluster is sparse, correcting imap
imap claims inode 169 is present, but inode cluster is sparse, correcting imap
imap claims inode 170 is present, but inode cluster is sparse, correcting imap
imap claims inode 171 is present, but inode cluster is sparse, correcting imap
imap claims inode 172 is present, but inode cluster is sparse, correcting imap
imap claims inode 173 is present, but inode cluster is sparse, correcting imap
imap claims inode 174 is present, but inode cluster is sparse, correcting imap
imap claims inode 175 is present, but inode cluster is sparse, correcting imap
imap claims inode 176 is present, but inode cluster is sparse, correcting imap
imap claims inode 177 is present, but inode cluster is sparse, correcting imap
imap claims inode 178 is present, but inode cluster is sparse, correcting imap
imap claims inode 179 is present, but inode cluster is sparse, correcting imap
imap claims inode 180 is present, but inode cluster is sparse, correcting imap
imap claims inode 181 is present, but inode cluster is sparse, correcting imap
imap claims inode 182 is present, but inode cluster is sparse, correcting imap
imap claims inode 183 is present, but inode cluster is sparse, correcting imap
imap claims inode 184 is present, but inode cluster is sparse, correcting imap
imap claims inode 185 is present, but inode cluster is sparse, correcting imap
imap claims inode 186 is present, but inode cluster is sparse, correcting imap
imap claims inode 187 is present, but inode cluster is sparse, correcting imap
imap claims inode 188 is present, but inode cluster is sparse, correcting imap
imap claims inode 189 is present, but inode cluster is sparse, correcting imap
imap claims inode 190 is present, but inode cluster is sparse, correcting imap
imap claims inode 191 is present, but inode cluster is sparse, correcting imap
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
root inode lost
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
reinitializing realtime bitmap inode
reinitializing realtime summary inode
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
SB summary counter sanity check failed
Metadata corruption detected at 0xaaaac17deba0, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
SB summary counter sanity check failed
Metadata corruption detected at 0xaaaac17deba0, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
xfs_repair: Releasing dirty buffer to free list!
xfs_repair: Refusing to write a corrupt buffer to the data device!
xfs_repair: Lost a write to the data device!

fatal error -- File system metadata writeout failed, err=117.  Re-run xfs_repair.

mpv_2024-10-01-010813_video0_00:09:18 516

smira commented 1 month ago

This looks like disk corruption to me, or some other hardware issue.

lwbt commented 1 month ago

It obviously looks like it, but the filesystem where grub and normal.mod are stored are not XFS if I recall correctly (FAT/EXT4?). This is reproducible on Talos Linux only. I have been running Ubuntu before on these compute blades and have not encountered such an issue yet.

Also notice that the U-Boot logo shows up in wrong colors. I tested with v1.7 and v1.8 on compute blade and regular Raspberry Pi 4 B. I was going to open an issue for that, as it happens on both devices only on v1.8, v1.7 was fine.

https://github.com/siderolabs/sbc-raspberrypi/issues/22

smira commented 1 month ago

The grub filesystem is xfs, and it is not even mounted during normal operations, so the corruption should have happened at the moment it was written.

The SBC support is a community-driven effort, see https://github.com/siderolabs/sbc-raspberrypi/

lwbt commented 1 month ago

I just tested with a regular Raspberry Pi 4 B and the issue was not reproducible there.

lwbt commented 1 month ago

I just upgraded from 1.8.0 to 1.8.1 and the issue is not reproducible any more for me.

Note: I forgot to update the client binary first, which resulted in an upgrade from 1.8.0 to 1.8.0. So may be an upgrade even to the same version number helps when this occurs.

I'm closing this issue now.