Closed Syntax3rror404 closed 2 months ago
Try to look into the boot logs, my guess is that your NVMe drives get re-numbered on each boot, so you should better use a disk selector
Try to look into the boot logs, my guess is that your NVMe drives get re-numbered on each boot, so you should better use a disk selector
Hi @smira
As i know from the documentation, disk selectors are only allowed and supported in the machine.install[] field and not in the machine.disks[] field. So there is no way for a selector. See: https://www.talos.dev/v1.7/reference/configuration/v1alpha1/config/#Config.machine.disks.
In the boot the disks getting the same number every boot, so this is not the source of this issue.
You can use udevd-style symlinks for user disks. If not that, I don't think it's possible even.
I've also tryed /dev/disk/by-id/ from the output from talosctl -n 1.2.3.4 list /dev/disk/by-id
By the way this command can only executed, when the system is running mode. So you cant use --insecure flag to ask the system for the ID for the machineconfig. So you need to install talos and then run this command and then you can patch the system again.
ext4 fs is also recognized as xfs. Is this a bug? I want to test, if this problem still exists when i change the FS from the DATA NVME from XFS to EXT4.
Interesting thing, the output from talosctl disks
shows not the same uuid from /dev/disk/by-uuid thats strange?!
But at the end, I get the same error message with xfs if the system gets unexpected rebooted. The only thing which helped is to wipe the disk and reformat the drive and then boot into talos again and then it works. But you know, this makes it unusable.
If this happens in production, it gets a extremly bad day.
I hope this issue gets addressed, talos is by far the best OS out there for running kuberentes. It is the OS everybody dreams for.
Nevermind .... Found the issue
Bug Report
Every power loss or forced reboot XFS gets dead and Talos hangs forever in the boot stage.
I tryed it with several NVME drives to make sure this is not depended on a dead drive but every time forcing the reboot results in a total crash.
Every time only on the data disk which is absolutly blank without data on it.
How can I deal with this? In production there is also a chance for this szenario.
I also tryed it with ext4 as fs but it looks like talos expect and allows only XFS as user disks. Dont know if this is expected, for example if the device is ext4 formated, then talos searches for the xfs superblocks. But this is a other issue or more are feature request i guess.
Edit: Also graceful reboots with talosctl can cause this issue, but a hard reboot is a 99% chance to hit this issue. Tested it again with different NVMe drives with newest firmware and 100% healthy smart values. I also tryed a brand new nvme drive. So a issue with the drive's are extremly unlikely.
Description
I use Longhorn as storage solution.
I have two NVME drives / Node
Use it with the following machineconfig:
Hardware:
CPU: Ryzen 5700U MEM: 64GB 3200MHz CT2K32G4SFD832A NIC: 2.5 gig/s r8169 DISK1: NVME 500GB (SYSTEM) Samsung 980 PRO DISK2: NVME 2TB (DATA) Western Digital Red SN700
Logs
Environment