openzfs / openzfs-docs

OpenZFS Documentation
https://openzfs.github.io/openzfs-docs/
133 stars 194 forks source link

Broken NixOS setup as a result of now removed mirrored boot setup docs #531

Open FrostKiwi opened 4 days ago

FrostKiwi commented 4 days ago

I followed this Repo's setup for Root on ZFS for NixOS during NixOS 22.11 to create a NixOS setup, where 2 SSDs are mirrored for a redundant boot drive. This resulted in very weird issues at that time ( https://github.com/NixOS/nixpkgs/issues/214871 ), which were resolved with updates by in https://github.com/openzfs/openzfs-docs/commit/1211e98faf1f37af1de5eb8f3ce0a1c87f71a0e6 by @gmelikov, as reported in https://github.com/openzfs/openzfs-docs/pull/383. This setup ran fine for me a very long time, with the following config, as per the Root on ZFS docs:

{ config, pkgs, ... }:

{
  networking.hostId = "XXXX";
  boot = {
    supportedFilesystems = [ "zfs" ];
    kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;
    loader = {
      efi = {
        efiSysMountPoint = "/boot/efis/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966XXXX-part1";
        canTouchEfiVariables = true;
      };
      generationsDir.copyKernels = true;
      grub = {
        efiInstallAsRemovable = false;
        enable = true;
        copyKernels = true;
        efiSupport = true;
        zfsSupport = true;
        extraInstallCommands = ''
ESP_MIRROR=$(${pkgs.coreutils}/bin/mktemp -d)
${pkgs.coreutils}/bin/cp -r ${config.boot.loader.efi.efiSysMountPoint}/EFI $ESP_MIRROR
for i in /boot/efis/*; do
 ${pkgs.coreutils}/bin/cp -r $ESP_MIRROR/EFI $i
done
${pkgs.coreutils}/bin/rm -rf $ESP_MIRROR
'';
        devices = [
          "/dev/disk/by-id/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966XXXX"
          "/dev/disk/by-id/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966XXXX"
        ];
      };
    };
  };
  users.users.root.initialHashedPassword = "XXXX";
}

During the update of NixOS 24.05, this setup exploded the update process. The update process finished with all packages rebuilt and restarted, but failed at the final steps, to create what I would guess is an unbootable state, though I haven't tried to reboot yet.

$ nixos-rebuild switch --upgrade
unpacking channels...
building Nix...
building the system configuration...
updating GRUB 2 menu...
/nix/store/mr63za5vkxj0yip6wj3j9lya2frdm3zc-coreutils-9.5/bin/cp: cannot stat '/boot/efis/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966732E-part1/BOOT': Too many levels of symbolic links
/nix/store/mr63za5vkxj0yip6wj3j9lya2frdm3zc-coreutils-9.5/bin/cp: cannot stat '/boot/efis/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966732E-part1/NixOS-boot-efis-nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966731K-part1': Too many levels of symbolic links
warning: error(s) occurred while switching to the new configuration

Via cc6d72c02db6f36136be3f4b7ae273b8271333a7 and 1211e98faf1f37af1de5eb8f3ce0a1c87f71a0e6 these instructions were deleted with commit messages:

Previously we used a bind mount from /boot/efis/*-part1 to /boot/efi to facilitate bootloader configuration. Recent reports indicate that this bind mount prevents the system from booting. This pull request removes the bind mount.

Now the Root on ZFS docs just say: Format and mount ESP. Only one of them is used as /boot, you need to set up mirroring afterwards, with no new documentation to take its place. Also the documentation says:

If you have a bug report or feature request related to this HOWTO, please file a new issue and mention @ne9z.

But that user is deleted, so I assume this was the handle of Maurice Zhou <ja@apvc.uk>

What would be appropriate steps to migrate this? I was recommended by the NixOS discord to look into boot.loader.grub.mirroredBoots, which seems to support the mirroring previously implemented by the bash snippet in extraInstallCommands.

What are good next steps to take, to make the system viable again? How should I migrate away from the now deleted extraInstallCommands script? I have a rough plan in my head, but since this concerns a live system, I would love some input.

gmelikov commented 2 days ago

Unfortunately I don't use NixOS to help you somehow, and we don't have an active NixOS doc contributor. If there'll be more problems with this guide, we'll have to deprecate it.

FWIW maybe as a workaround you may use only one boot disk as a start.