siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.86k stars 549 forks source link

Bare Metal XFS breaks every power loss and stucks at the boot stage #9217

Closed Syntax3rror404 closed 2 months ago

Syntax3rror404 commented 2 months ago

Bug Report

Every power loss or forced reboot XFS gets dead and Talos hangs forever in the boot stage.

I tryed it with several NVME drives to make sure this is not depended on a dead drive but every time forcing the reboot results in a total crash.

Every time only on the data disk which is absolutly blank without data on it.

How can I deal with this? In production there is also a chance for this szenario.

I also tryed it with ext4 as fs but it looks like talos expect and allows only XFS as user disks. Dont know if this is expected, for example if the device is ext4 formated, then talos searches for the xfs superblocks. But this is a other issue or more are feature request i guess.

Edit: Also graceful reboots with talosctl can cause this issue, but a hard reboot is a 99% chance to hit this issue. Tested it again with different NVMe drives with newest firmware and 100% healthy smart values. I also tryed a brand new nvme drive. So a issue with the drive's are extremly unlikely.

Description

I use Longhorn as storage solution.

I have two NVME drives / Node

  1. For Talos installation
  2. For Longhorn mounted at /var/lib/longhorn

Use it with the following machineconfig:

install:
    disk: /dev/nvme0n1
    image: factory.talos.dev/installer-secureboot/82d8dd500d44101247ba049512925aaebd7838ff7e52167c7f0fd496f7d0c06a:v1.7.6
    wipe: true
kubelet:
    image: ghcr.io/siderolabs/kubelet:v1.30.3
    defaultRuntimeSeccompProfileEnabled: true
    disableManifestsDirectory: true
    extraMounts:
        - destination: /var/lib/longhorn
          type: bind
          source: /var/lib/longhorn
          options:
              - bind
              - rshared
              - rw
disks:
    - device: /dev/nvme1n1
      partitions:
          - mountpoint: /var/lib/longhorn

Hardware:

CPU: Ryzen 5700U MEM: 64GB 3200MHz CT2K32G4SFD832A NIC: 2.5 gig/s r8169 DISK1: NVME 500GB (SYSTEM) Samsung 980 PRO DISK2: NVME 2TB (DATA) Western Digital Red SN700

Logs

user: warning: [2024-08-23T03:22:02.433991977Z]: [talos] task mountUserDisks (1/1): mountUserDisks failed, rebooting in 35 minutes.
user: warning: [2024-08-23T03:22:02.434027977Z]: error mounting "/dev/nvme1n1p1": error mounting: 1 error(s) occurred: 
user: warning: [2024-08-23T03:22:02.434047977Z]:  error repairing: xfs_repair: exit status 1: Phase 1 - find and verify superblock...
user: warning: [2024-08-23T03:22:02.434067977Z]: bad primary superblock - bad magic number !!!                                                                                                                                             user: warning: [2024-08-23T03:22:02.434082977Z]:  
[2024-08-23T03:22:02.434110977Z]: .................................................................................................................
...........................................................................................................................................................................................
user: warning: [2024-08-23T03:22:02.434094977Z]: attempting to find secondary superblock...
user: warning: [2024-08-23T03:22:02.434205977Z]: Exiting now.

Environment

smira commented 2 months ago

Try to look into the boot logs, my guess is that your NVMe drives get re-numbered on each boot, so you should better use a disk selector

Syntax3rror404 commented 2 months ago

Try to look into the boot logs, my guess is that your NVMe drives get re-numbered on each boot, so you should better use a disk selector

Hi @smira

As i know from the documentation, disk selectors are only allowed and supported in the machine.install[] field and not in the machine.disks[] field. So there is no way for a selector. See: https://www.talos.dev/v1.7/reference/configuration/v1alpha1/config/#Config.machine.disks.

In the boot the disks getting the same number every boot, so this is not the source of this issue.

smira commented 2 months ago

You can use udevd-style symlinks for user disks. If not that, I don't think it's possible even.

Syntax3rror404 commented 2 months ago

I've also tryed /dev/disk/by-id/ from the output from talosctl -n 1.2.3.4 list /dev/disk/by-id By the way this command can only executed, when the system is running mode. So you cant use --insecure flag to ask the system for the ID for the machineconfig. So you need to install talos and then run this command and then you can patch the system again.

ext4 fs is also recognized as xfs. Is this a bug? I want to test, if this problem still exists when i change the FS from the DATA NVME from XFS to EXT4.

Interesting thing, the output from talosctl disks shows not the same uuid from /dev/disk/by-uuid thats strange?!

But at the end, I get the same error message with xfs if the system gets unexpected rebooted. The only thing which helped is to wipe the disk and reformat the drive and then boot into talos again and then it works. But you know, this makes it unusable.

If this happens in production, it gets a extremly bad day.

I hope this issue gets addressed, talos is by far the best OS out there for running kuberentes. It is the OS everybody dreams for.

Syntax3rror404 commented 2 months ago

Nevermind .... Found the issue