nix-community / disko

Declarative disk partitioning and formatting using nix [maintainer=@Lassulus]
MIT License
1.63k stars 177 forks source link

`mdadm` array not symlinked correctly after creation #705

Open szethh opened 1 month ago

szethh commented 1 month ago

I'm trying to set up a raid1 array using mdadm. The array is created correctly, but it seems that the symlink to /dev/md/arrayname is not created correctly.

These are the last log lines before the process fails:

mkfs.fat 4.2 (2021-01-31)
+ device=/dev/disk/by-partlabel/disk-_dev_sdb-nixos
+ name=nixos
+ type=mdraid
+ echo /dev/disk/by-partlabel/disk-_dev_sdb-nixos
+ level=1
+ metadata=default
+ name=nixos
+ type=mdadm
+ test -e /dev/md/nixos
+ readarray -t disk_devices
++ cat /tmp/tmp.YXB4lIUGsi/raid_nixos
+ echo y
++ wc -l /tmp/tmp.YXB4lIUGsi/raid_nixos
++ cut -f 1 -d ' '
+ mdadm --create /dev/md/nixos --level=1 --raid-devices=2 --metadata=default --force --homehost=any /dev/disk/by-partlabel/disk-_dev_sda-nixos /dev/disk/by-partlabel/disk-_dev_sdb-nixos
mdadm: array /dev/md/nixos started.
mdadm: timeout waiting for /dev/md/nixos
+ partprobe /dev/md/nixos
Error: Could not stat device /dev/md/nixos - No such file or directory.
+ rm -rf /tmp/tmp.YXB4lIUGsi
Connection to <ip> closed.

If I ssh into the host and run this same command (mdadm --create /dev/md/nixos --level=1 --raid-devices=2 --metadata=default --force --homehost=any /dev/disk/by-partlabel/disk-_dev_sda-nixos /dev/disk/by-partlabel/disk-_dev_sdb-nixos) manually, the array is created at /dev/md127, with a symlink from /dev/md/nixos pointing to it.

So manually running the command creates the symlink, disko's script does not. In either case the array is created fine (below is the output for running lsblk after either method):

root@rescue ~ # lsblk
NAME      MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
loop0       7:0    0  3.1G  1 loop  
sda         8:0    0  3.6T  0 disk  
|-sda1      8:1    0  512M  0 part  
`-sda2      8:2    0  3.6T  0 part  
  `-md127   9:127  0  3.6T  0 raid1 
sdb         8:16   0  3.6T  0 disk  
|-sdb1      8:17   0  512M  0 part  
`-sdb2      8:18   0  3.6T  0 part  
  `-md127   9:127  0  3.6T  0 raid1 

Here's my disko config, although checking other people's it seems to be fine.

{ lib, ... }: {
  disk = lib.genAttrs [ "/dev/sda" "/dev/sdb" ] (disk: {
    type = "disk";
    device = disk;
    content = {
      type = "gpt";
      partitions = {
        ESP = {
          size = "512M";
          type = "EF00";
          content = {
            type = "filesystem";
            format = "vfat";
            mountpoint = "/boot";
          };
          priority = 1;
        };
        nixos = {
          size = "100%";
          content = {
            type = "mdraid";
            name = "nixos";
          };
          priority = 2;
        };
      };
    };
  });
  mdadm = {
    nixos = {
      type = "mdadm";
      level = 1;
      content = {
        type = "filesystem";
        format = "ext4";
        mountpoint = "/";
      };
    };
  };
}

The disks are 2 sata hdds in a bare metal hetzner server, if that matters. I am using disko via nixos-anywhere.

Lassulus commented 1 month ago

hmm, the device usually is created through udev, if you have set boot.swraid.enable = true those rules should be present on the system. you could check if you have /etc/udev/rules.d/63-md-raid-arrays.rules present

szethh commented 1 month ago

nothing in /etc/udev/rules.d

but my problem is that udevadm is not even called, the error is on the line before (here).

the partprobe command is what kills my process, and uvedadm runs in the line after

Lassulus commented 1 month ago

ah the udevadm communicates with the existing udev daemon, so it just waits for the rules to apply.

it's weird, is the rules directory completly empty? did you boot a live nixos image or does this use kexec with nixos-anywhere?

szethh commented 1 month ago

I'm using a modified version of nixos-anywhere. Hetzner's bare-metal servers do not seem to support kexec for some reason (i'm not a linux person so the reason escapes me).

Here's an issue about the whole deal nix-community/nixos-anywhere#346. To work around it, another user created this script (johanot/nixos-anywhere@2c804da) that skips installing kexec as part of nixos-anywhere.

So disko is running in Hetzner's rescue system.

edit: to answer your question, yes the rules directory is completely empty

Lassulus commented 1 month ago

ah, so my guess would be, that the hetzner rescue system does not use udev then, and for that reason the mdadm call fails. the device gets created though, but not at the expected location. Not sure how to fix that. Best way forward would be to fix kexec I guess

szethh commented 1 month ago

but the thing is that if i run that command myself (outside of the disko context, just copying and pasting that command), in that same rescue system, it works fine and both the array and the symlink are created as intended. that's what puzzles me and i don't really have enough linux/mdadm/disko knowledge to piece together why :/

but yeah agreed that fixing kexec would be the best. from what i've heard @johanot say in the other thread this seems to be an issue that popped up recently, but worked before.

Lassulus commented 1 month ago

hmm, weird, maybe it's a weird racecondition in that case? but not sure

szethh commented 1 month ago

ended up forking the project https://github.com/szethh/disko/commit/770ff74e9bd45f725d603f64b9023db31e06bdcf.

i saw others run into this issue too (10 years ago...), but their fix of sleeping before calling mdadm did not work for me. the fix is simple: make the symlink myself.

the rest of the program went smoothly, and i was able to boot into a working nixos system :D