nix-community / disko

Declarative disk partitioning and formatting using nix [maintainer=@Lassulus]
MIT License
1.63k stars 177 forks source link

ZFS features support #298

Open nbraud opened 1 year ago

nbraud commented 1 year ago

As suggested in https://github.com/nix-community/disko/issues/267#issuecomment-1636789807, here's an issue to summarize ZFS features that disko does not currently support:

Mic92 commented 1 year ago

Always good to include some examples for @Lassulus

Examples for Log device:

This creates 2 mirrors over two disks and a log device mirror over two disks

$ zpool create datap mirror c0t5000C500335F95E3d0 c0t5000C500335F907Fd0 \
   mirror c0t5000C500335BD117d0 c0t5000C500335DC60Fd0 \
   log mirror c0t5000C500335E106Bd0 c0t5000C500335FC3E7d0

Example for cache device:

$ zpool create system1 mirror c2t0d0 c2t1d0 c2t3d0 cache c2t5d0 c2t8d0

Example for multiple vdevs (in this example multiple mirrors that create a vdev each):

$ zpool create system1 mirror c1d0 c2d0 mirror c3d0 c4d0
victorklos commented 1 year ago

Some more examples @Lassulus:

mirrored rust and mirrored special device

$ zpool create mypool mirror a b special mirror e f

raidz with special device

$ zpool create mypool raidz a b c d special mirror e f
victorklos commented 1 year ago

I have given this issue some thought, as the lack of expressiveness is blocking my application of disko (and nixos-anywhere).

The current approach is to define disks and partitions as type zfs belonging to some pool. Then on pool creation, all disks/partitions that were marked belonging to that pool are simply copied into the command line. In other words, there is no concept of a vdev so all examples above cannot be created (more extreme even, as there is only one mode per zpool only very simple mirror a b or raidz a b c topologies can de defined).

Now one could introduce the vdev concept in disko, but that would take quite some programming and probably turn out rigorous. And when we assume that anyone fiddling around with ZFS will have at least some understanding of the ZFS zpool concepts, we could leave the flexibility to the user.

So I propose zfs_tag in disk and topology in zpool, that would allow the following:

disk = {
  hdda = {
    type = "disk";
    device = "/dev/disk/by-id/ata-XXX";
    content = {
      type = "zfs";
      pool = "tank"; # could be left out
      zfs_tag = "a";
    };
  };
  hddb = {
    type = "disk";
    device = "/dev/disk/by-id/ata-YYY";
    content = {
      type = "zfs";
      pool = "tank"; # could be left out
      zfs_tag = "b";
    };
  };
};
zpool = {
  tank = {
    type = "zpool";
    options = {
      ashift = "12";
      autotrim = "on";
    };
    topology = "mirror ${a} ${b}";
  };
};

Of course zfs_tag could be replaced by zfs_vdev_member or whatever tickles your fancy.

Mic92 commented 1 year ago

I think topology than needs to be a function to be able to reference device names:

disk = {
  hdda = {
    type = "disk";
    device = "/dev/disk/by-id/ata-XXX";
    content = {
      type = "zfs";
      pool = "tank"; # could be left out
      zfs_tag = "a";
    };
  };
  hddb = {
    type = "disk";
    device = "/dev/disk/by-id/ata-YYY";
    content = {
      type = "zfs";
      pool = "tank"; # could be left out
      zfs_tag = "b";
    };
  };
};
zpool = {
  tank = {
    type = "zpool";
    options = {
      ashift = "12";
      autotrim = "on";
    };
    topology = devices: "mirror ${devices.a} ${devices.b}";
  };
};
victorklos commented 1 year ago

Ah nice, I am very new to nix!

Would that also work for tagged partitions? (Not in my example.)

Lassulus commented 1 year ago

hmm, I want to avoid users to define functions in disko, we use the ability to serialize the config to/from json. also we cant type validate it anymore. maybe we introduce a new toplevel type? can one partition be part of multiple vdevs? maybe something like that:

disk = {
  hdda = {
    type = "disk";
    device = "/dev/disk/by-id/ata-XXX";
    content = {
      type = "zfs";
      vdevs = [ "aaa" ];
    };
  };
  hddb = {
    type = "disk";
    device = "/dev/disk/by-id/ata-YYY";
    content = {
      type = "zfs";
      vdevs = [ "aaa" ];
    };
  };
};
zfs_vdev = {
   aaa = {
     type = "mirror";
     zpool = "tank"
   };
};
zpool = {
  tank = {
    type = "zpool";
    options = {
      ashift = "12";
      autotrim = "on";
    };
  };
};
victorklos commented 1 year ago

can one partition be part of multiple vdevs?

No.

A pool consists of one or more virtual devices or vdevs. If you lose a vdev, you lose the pool. So, it makes sense to have redundancy at the vdev level. This can be done in many ways: (2, 3, ..., n-way) mirror, raidz, draid etc. Also, you can add additional stuff to a pool that is not a vdev. Examples are cache, ZIL, special devices etc. Some you can afford to lose, others make you lose data or even the pool in case of failure. So for these extras there is also the possibility for data redundancy with mirror etc. (Note that there is the possibility to organise redundancy at the dataset level, but that is not important for this discussion.)

You can create vdevs using files, partitions and/or disks. Files are supported for debugging purposes, and disko not supporting that is absolutely fine. In principle, ZFS should be given disks but using a partition is not a problem as long as you don't use the rest of the disk actively in the same or another pool. For example, having a disk with an EFI partition and a second partition which you add to a vdev is perfectly ok.

Some examples of what should be supported are at the top of this thread. But basically:

zpool create =
  some mode of multiple (1+) partitions/disks for vdev a +
  some mode of multiple (1+) partitions/disks for vdev b +     # repeat many times
  some mode of multiple (1+) partitions/disks for purpose x +
  some mode of multiple (1+) partitions/disks for purpose y  # repeat for cache, ZIL, special etc.

(If I weren't too rusty I'd give BNF a shot but alas ;)

The combinations are vast, which is why in my original proposal I tried giving a topology as a string. It doesn't have to be a function of course, and having a string can at least provide some level of parsing, checking and generation of warnings or errors. And is also serialisable.

victorklos commented 1 year ago

Time for another iteration. How about this:

disk = {
  ssd = {
    type = "disk";
    device = "/dev/disk/by-id/ata-AAA";
    content = {
      type = "table";
      format = "gpt";
      partitions = [
      {
        name = "ESP";
        end = "512MB";
        bootable = true;
        content = {
          type = "filesystem";
          format = "vfat";
          mountpoint = "/boot";
        };
      }
      {
        start = "512MB";
        end = "100%";
        content = {
          type = "zfs";
          zfs_member_tag = "cache";
        };
      }
      ];
    };
  };
  hdda = {
    type = "disk";
    device = "/dev/disk/by-id/ata-XXX";
    content = {
      type = "zfs";
      zfs_member_tag = "vdev";
    };
  };
  hddb = {
    type = "disk";
    device = "/dev/disk/by-id/ata-YYY";
    content = {
      type = "zfs";
      zfs_member_tag = "vdev";
    };
  };
};
zpool = {
  tank = {
    type = "zpool";
    options = {
      ashift = "12";
      autotrim = "on";
    };
    topology = disko.lib.zfs.mkVdevMirror "vdev" + disko.lib.zfs.mkCache "cache"
  };
};

Nix is not one of my languages (yet) so excuse the possible nonsense, but I hope this makes enough sense. Possible checks that could be done at runtime include having at least one vdev, not assigning the same member to multiple ~mirrors~ pools etc.

It is a question of style to choose between mkVdev "mirror" "vdev" and mkVdevMirror "vdev" variants, plus there is some other stuff to be sorted out but please regard this as an iteration, not a final proposal.

BTW I renamed zfs_tag to zfs_member_tag because ZFS calls partitions and disks that are in a pool members.

Lassulus commented 1 year ago

how about we change the field from zfs_member_tag to just to name? and in topology we can do something like that:

topology = {
  somename = {
    type = "mirror";
    members = [ "vdev_XXX" "vdev_YYY" ];
  };
  somename2 = {
    type = "cache";
    members = [ "vdev_cache" ];
  };
};

not sure if we can do everything with that and how clunky it is. I'm still a bit confused about whats the difference between mode and purpose, and what are available modes and purposes.

Lassulus commented 1 year ago

or maybe something with tags is quite nice:

topology = {
  "sometagname" = { type = "mirror"; };
  "cache" = { type = "cache"; };
  "mirror" = { type = "mirror"; };
};

whats nicer with tags than names, we don't have to make sure that tags are unique. So I would prefer those for implementing it?

victorklos commented 1 year ago

So a mode (a word I used because it is already in the implementation) indicates - on the command line - the way a bunch of members provide some form of redundancy. The most important ones are "mirror" and "raidz" (at least if you ask me).

Those members can become a vdev, but don't have to. A pool must have at least one vdev, to store data, can have more vdevs but also can be enhanced by using (bunches of) members to perform other specific roles, what I called purposes above. Examples of these are cache (think of an SSD to make recurring reads faster), special (to store metadata separately for speed increases and provide flexibility like choosing what data goes on SSD (e.g. VM's, containers) and what data on spinning rust (e.g. media)) and ZIL (ZFS Intent Log, for super fast but synchronous writes). The designer of the pool chooses the combination depending on the expected types of traffic the pool will meet.

Some annotated examples:

# create a pool named tank with a single vdev comprising a single partition/disk
zpool create tank a

# create a pool named tank with a single vdev with two disks in mirror
zpool create tank mirror a b

# create a pool named tank with two vdevs, both striped, with a single SSD cache
zpool create tank raidz a b c raidz d e f cache g

# create a pool named tank with a single vdev of two HDD's in mirror, with a special
# (metadata) virtual device on SSDs also in mirror (my favourite in the homelab)
zpool create tank mirror a b special mirror c d

Many more details are in ZFS pool concepts.

victorklos commented 1 year ago

how about we change the field from zfs_member_tag to just to name?

Yeah, this looks very natural to me. Just with a tiny fix regarding the vdevs:

topology = {
  vdev = {
    type = "mirror";
    members = [ "XXX" "YYY" ];
  };
  cache = {
    type = "cache";
    members = [ "cache" ];
  };
};

Would work for me!

victorklos commented 1 year ago

(For the tags versus names comment: I am not proficient enough in nix to have an opinion on that)

victorklos commented 1 year ago

Sorry for spamming, we are close but not there yet! I think we still need a mode element.

How about:

topology = {
  vdev = {
    type = "vdev"; # results in empty string on command line, see examples
    mode = "raidz";
    members = [ "XXX" "YYY" "ZZZ" ];
  };
  special = {
    type = "special";
    mode = "mirror";
    members = [ "ssda" "ssdb" ];
  };
  cache = {
    type = "cache";
    mode = null; # can be left out
    members = [ "cache" ];
  };
};

With all the member references resolving to the names of the partitions/disks as you suggested.

Lassulus commented 1 year ago

what happens if a partition would be in the mirror and in the cache memberlist? is this something we want to catch or is this ok?

victorklos commented 1 year ago

That would cause zpool create to fail:

# truncate --size=10G {a,b}
# zpool create errpool mirror /root/a /root/b cache /root/b
cannot create 'errpool': one or more vdevs refer to the same device, or one of
the devices is part of an active md or lvm device
victorklos commented 1 year ago

In order to get the point across that a single vdev must exist for data, I stated above that bunches of disks can become vdevs. This is not exactly how it works, see zpool concepts.

So a vdev is just a virtual device. It can be a single file, a partition, a disk, or a combination of those in mirror, raidz or draid (a specific kind of raidz). Some vdevs only hold specific data. These are (intent) log, dedup, special and cache. This latter category are vdevs (and are created of a single file, a partition, a disk or a combination of those in mirror, raidz or draid) but the distinction is they don't count as a regular vdev (of which a pool must have at least one, to store data).

In other words, where the ZFS documentation doesn't make the distinction between mode and type I still feel we should, because the zpool create command line demands it.

That being said, the type = "vdev" above should probably be changed into type = "plain" (or regular or data or normal or ...) because as outlined above all entries in the topology are in fact vdevs.

Like this:

topology = {
  data = {
    type = "plain"; # results in empty string on command line, see examples
    mode = "raidz";
    members = [ "XXX" "YYY" "ZZZ" ];
  };
  special = {
    type = "special";
    mode = "mirror";
    members = [ "ssda" "ssdb" ];
  };
  cache = {
    type = "cache";
    mode = null; # can be left out
    members = [ "cache" ];
  };
};

I would be willing to give a shot at implementing this if someone could give me some pointers as to how to approach this and what other code sort of looks like it (at least enough to function as inspiration ;). I am new to nix but not to programming.

PS or even vdevs = {} instead of topology?

Lassulus commented 1 year ago

vdev instead of topology sounds like a good idea. I think we should stay very close to zfs terms so people are not confused by new words/concepts we introduce. about implementing. I guess this will be quite intricate. not sure how exactly we want to do it. best case would be to just create just some example configs and the corresponding commandlines it should generate.

qm3ster commented 1 year ago

If we use ZFS terms, topology would actually be the name for the mode field. (And if we don't, I much prefer role to type 😛 ) I suggest type "plain" be encoded by default/null/omission. Of note is that the "cache" type supports only a single member. In general, I suggest the null mode enforcing there only being a single member / having multiple members demand a mode be specified. I think this would look nicest if we pass a single device string instead of a list in those cases, making it more apparent.

victorklos commented 1 year ago

Good points on the cache, single member would definitely be more clear. And I like role too.

Personally I think topology would describe the layout of the whole pool, not just a single vdev. But the thing with topology is (and I proposed it myself): the word is not in man zpool create nor in man zpoolprops nor in man zpoolconcepts. So while it may sound like an appropriate word and normal ZFS parlance, ZFS itself goes a long way to describe zpools using just the word vdev and its plural. I myself would be fine with it, but I also do get the point Lassulus is making on staying close to ZFS terms.

So this is where we are now:

vdevs = {
  data = {
    mode = "raidz";
    members = [ "XXX" "YYY" "ZZZ" ];
  };
  special = {
    role = "special";
    mode = "mirror";
    members = [ "ssda" "ssdb" ];
  };
  cache = {
    role = "cache";
    member = "cache";
  };
};

Note that now member exists next to members but it makes everything so much clearer that the implementation effort is probably worth it. Speaking of which, I will try to create tests somewhere next week, this weekend is taken ;)

ConnorBaker commented 1 year ago

Just chiming in as a happy user of disko and nixos-anywhere -- I'm absolutely thrilled that this is being considered! I'm in the process of building a home server/desktop with 16 HDDs and need the ability to describe different configurations of storage pools than what we can currently.

@victorklos as someone newer to ZFS, I've found your comments extremely informative, especially given that they're provided with (potential) disko configurations.

victorklos commented 1 year ago

Thanks for your kind words @ConnorBaker. A nice example of the power of open source collaboration I'd say :+1:

TheRealGramdalf commented 6 months ago

Not quite sure if I should just open another issue, but this seems semi-related:

I'm getting an error when trying to run disko with the following config:

{
  disko.devices = {
    disk."zdisk" = {
      device = "/dev/sda";
      type = "disk";
      content = {
        type = "gpt";
        partitions = {
          # This allows a GPT partition table to be used with legacy BIOS systems. See https://www.gnu.org/software/grub/manual/grub/html_node/BIOS-installation.html
          "mbr" = {
            label = "mbr";
            size = "1M";
            type = "EF02";
          };
          "zroot" = {
            label = "zroot";
            size = "100%";
            content = {
              type = "zfs";
              pool = "zroot";
          };};
          "ZSYS" = {
            label = "ZSYS";
            end = "-512M"; # Negative size to place it at the end of the partition
            type = "EF00"; # ^^ Mostly desirable with resizeable partitions like BTRFS/EXT4, used here for compatibility
            content = {
              type = "filesystem";
              format = "vfat";
              mountpoint = "/boot";
          };};
        };
      };
    };
    zpool."zroot" = {
      type = "zpool";
      options.ashift = "12";
      rootFsOptions = {
        # These are inherited to all child datasets as the default value
        mountpoint = "none";
        compression = "zstd";
        xattr = "sa";
        acltype = "posix";
      };

      datasets = {
        "ephemeral" = {
          type = "zfs_fs";
          options = {
            canmount = "noauto";
            mountpoint = "legacy";
        };};
        "ephemeral/nix" = {
          type = "zfs_fs";
          mountpoint = "/nix";
          options = {
            atime = "off";
            canmount = "noauto";
        };};
        "safe" = {
          type = "zfs_fs";
          options = {
            canmount = "noauto";
            mountpoint = "legacy";
        };};
        "safe/persist" = {
          type = "zfs_fs";
          mountpoint = "/persist";
          options = {
            canmount = "noauto";
        };};
        "safe/home" = {
          type = "zfs_fs";
          mountpoint = "/home";
          options = {
            canmount = "noauto";
        };};
        "system-state" = {
          type = "zfs_fs";
          mountpoint = "/";
          options = {
            canmount = "noauto";
            mountpoint = "legacy";
        };};
      };
      #preMountHook = "zfs snapshot -r zroot@blank";
      # Needs fixing, runs multiple times
    };
  };
}

When mounting, it attempts to run mount -t zfs zroot/safe/home /mnt/home -o X-mount.mkdir -o defaults -o zfsutil, which fails saying

filesystem 'zroot/safe/home' cannot be mounted using 'zfs mount'.
Use 'zfs set mountpoint=/mnt/home' or 'mount -t zfs zroot/safe/home /mnt/home'.
See zfs(8) for more information.

I tracked it down to this line, which to my understanding adds -o zfsutil if the disko option datasets.<datasetname>.options.mountpoint = legacy is not set - this is an issue in my case because the option technically is set by natural zfs inheritance, but since it isn't declared within disko, it still adds -o zfsutil, which fails since the zfsprop mountpoint is set to legacy, not /path/to/mountpoint as expected by zfs mount.

For now I'll just add mountpoint = legacy in the disko config, but handling this more gracefully would be ideal. I'm really not sure how it would be best implemented, since you could either add custom glue to represent inheritance within disko (keeping in mind that not all values are inherited, i.e. canmount), or just run checks at runtime instead of checking against the disko config provided.

victorklos commented 6 months ago

Not quite sure if I should just open another issue, but this seems semi-related:

I'm getting an error when trying to run disko with the following config:

{
<SNIP>
}

When mounting, it attempts to run mount -t zfs zroot/safe/home /mnt/home -o X-mount.mkdir -o defaults -o zfsutil, which fails saying

filesystem 'zroot/safe/home' cannot be mounted using 'zfs mount'.
Use 'zfs set mountpoint=/mnt/home' or 'mount -t zfs zroot/safe/home /mnt/home'.
See zfs(8) for more information.

I tracked it down to this line, which to my understanding adds -o zfsutil if the disko option datasets.<datasetname>.options.mountpoint = legacy is not set - this is an issue in my case because the option technically is set by natural zfs inheritance, but since it isn't declared within disko, it still adds -o zfsutil, which fails since the zfsprop mountpoint is set to legacy, not /path/to/mountpoint as expected by zfs mount.

For now I'll just add mountpoint = legacy in the disko config, but handling this more gracefully would be ideal. I'm really not sure how it would be best implemented, since you could either add custom glue to represent inheritance within disko (keeping in mind that not all values are inherited, i.e. canmount), or just run checks at runtime instead of checking against the disko config provided.

I'd say this deserves its own issue.

luishfonseca commented 4 months ago

Hi, I'm interested in this issue and would be willing to work on a subset of it.

One of the most popular raid configurations is raid10, and ZFS verion of it is stripping mirrored vdevs (3rd of @Mic92's examples). We support stripping, mirrors, and all reasonable raid-z configs. This leaves stripped mirrors as the only missing essential zfs topology.

This issue would solve this lack of support, but there might be space for some preliminary work, implementing just striped mirrors and paving the way for fully flexible vdevs. One possible interface for this feature could be the following:

disk = {
  x.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    mode = "mirror";
    vdev = 0;
  };
  y.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    mode = "mirror";
    vdev = 0;
  };
  z.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    mode = "mirror";
    vdev = 1;
  };
  w.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    mode = "mirror";
    vdev = 1;
  };
};
# runs "zpool create zroot mirror x y mirror z w"

This moves the mode decision to the disk section, which in my mind makes sense, as this is a many-to-one relationship. This also matches ZFS' behavior a bit closer.

Some notes on my suggestion, all disks in the same vdev must have the same mode (zfs constraint), and by default the vdev is always 0. For backwards compatibility defining the mode in the zpool sets the mode of all disks in the pool. As such, the following is still valid:

# comments show derived attributes
disk = {
  x.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    # mode = "mirror";
    # vdev = 0;
  };
  y.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    # mode = "mirror";
    # vdev = 0;
  };
};
zpool.zroot.mode = "mirror";
# runs "zpool create zroot mirror x y"

This also potentially simplifies the first example. However, it complicates defining single disks stripped (currently just mode=""):

# comments show derived attributes
disk = {
  x.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    # mode = "";
    # vdev = 0;
  };
  y.content.partitions.zfs.content = {
    type = "zfs";
    pool = "zroot";
    # mode = "";
    # vdev = 0;
  };
  zpool.zroot.mode = "";
};
# runs "zpool create zroot x y"

This is not great, as this implies all are on the same vdev, when in reality they're multiple single drive vdevs. But it would work, due to the empty string.

All these suggestions are backwards compatible, and future work to expand this to log/cache/special devices would simply amount to adding a role setting mutually exclusive to vdev in the disk configs. What do you all think? :)

dmadisetti commented 1 month ago

To recap here's the proposed API from last year:

zpool."${zpool}" = {
 topology = { # exclusive with mode
    vdev = [{
      mode = "raidz";
      members = [ "sdXXX" "sdYYY" "sdZZZ" ];
    }]; # repeated
    special = {
      mode = "mirror";
      members = [ "sdAAA" "sdBBB" ];
    };
    cache = "sdcache"; # only 1 device allowed
  };
}

Creation command is something like

let
  fmt=v: (v.mode + " " + join " " v.members);
in
("zpool create ${name} ${map fmt topology.vdev} "
+ (if topology ? special then "special ${fmt topology.special}" else "")
+ (if topology ? cache then "cache ${topology.cache}"))
# plus the option stuff already on head

Seems pretty straightforward. Am I missing something?

Probably needs an assertion that all devices are accounted for when using topology instead of mode. Also support for spare log and draid, but those are beyond my use case (and no one has mentioned them here).

but I'm happy to get the PR as proposed above in, with those assertions and an example

dmadisetti commented 1 month ago

Or instead of mode and topology being exclusive, mode == "prescribed", when a topology is given

Mic92 commented 3 days ago

https://github.com/nix-community/disko/pull/723 was merged

Mic92 commented 3 days ago

Delegations sounds like a simple feature to implement.