openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.44k stars 1.73k forks source link

Feature: zpool get/set/list being able to address pool vdevs/vdev members #3810

Open GregorKopka opened 8 years ago

GregorKopka commented 8 years ago

It would be beneficial to allow zpool get/set operations to target vdevs and their members.

Get ashift of pool (https://github.com/zfsonlinux/zfs/issues/2024):

$ zpool get ashift tank/raidz2-1
tank/raidz2-1 ashift 12
$ zpool get ashift tank -r
tank          ashift 9
tank/raidz2-0 ashift 9
tank/raidz2-1 ashift 12

Allow features like

No new allocations on vdev (for whatever reasons), or even automatically when degraded (manually set, or while replace of member disk is running) while still updating superblocks to keep the pool consistent:

$ zpool set readonly=on tank/raidz2-1

Do not serve reads from this vdev member while redundancy allows it (slow drive in mirror for more cheap redundancy, https://github.com/zfsonlinux/zfs/issues/1742):

$ zpool set writeonly=on tank/mirror-0/slowdrive1

This would also deliver the interface to tune features like Metadata Allocation Class (https://github.com/zfsonlinux/zfs/issues/3779)

$ zpool set allocate=metadata tank/mirror-0

Combined this would allow for interesting performing setups, (example) pool with:

$ zpool list -o name,size,ashift,allocate,writeonly,readonly tank -r

NAME                            SIZE  ASHIFT    ALLOCATE  WRITEONLY  READONLY
tank                           <big>      12         all          -        off
tank/mirror-0                   500G       9    metadata          -        off
tank/mirror-0/SSD1              500G       9           -        off          -
tank/mirror-0/HDD1              500G       9           -         on          -
tank/mirror-0/HDD2              500G       9           -         on          -
tank/mirror-0/HDD3              500G       9           -         on          -
tank/mirror-1                   250G       9         ddt          -        off
tank/mirror-1/SSD2              250G       9           -        off          -
tank/mirror-1/SSD3              250G       9           -        off          -
tank/mirror-1/HDD5              250G       9           -         on          -
tank/mirror-1/HDD6              250G       9           -         on          -
tank/raidz3-0                    16T      12        data          -        off
   [...]
tank/raidz3-1                    16T      12        data          -         on
tank/raidz3-1/HDD3-1              4T      12        data          -          -
tank/raidz3-1/HDD3-2              4T      12        data          -          -
tank/raidz3-1/replacing-0         4T      12        data          -          -
tank/raidz3-1/replacing-0/HDD3-3  4T      12        data          -         on
tank/raidz3-1/replacing-0/new     4T      12        data          -          -
tank/raidz3-1/HDD3-4              4T      12        data          -          -
tank/raidz3-1/HDD3-5              4T      12        data          -          -
tank/raidz3-1/HDD3-6              4T      12        data          -          -
tank/raidz3-1/HDD3-7              4T      12        data          -          -
tank/raidz3-2                    16T      12        data          -        off
[...]
GregorKopka commented 6 years ago

To bump this a little after the OpenFS Summit talk about vdev properties (youtube)

@allanjude Thanks for working on this. I think the user interface (as presented) needs a different approach.

As the vdevs are a hirarchical tree structure, wouldn't it make more sense to extend the existing interface of zpool get/set/list by allowing the [pool] argument of them to be poolname[/vdev-name[/...]] and adding -r and -d to recurse over children ?

This would change the example in your talk from

zpool get size@ada0p3,alloc@ada0p3,free@ada0p3,ashift@ada0p3,state@ada0p3 zroot
NAME   PROPERTY       VALUE   SOURCE
zroot  size@ada0p3    45.5G   -
zroot  alloc@ada0p3   19.5G   -
zroot  free@ada0p3    26.0G   -
zroot  ashift@ada0p3  12      -
zroot  state@ada0p3   ONLINE  -

to

zpool get size,alloc,free,ashift,state zroot/ada0p3
NAME          PROPERTY  VALUE   SOURCE
zroot/ada0p3  size      45.5G   -
zroot/ada0p3  alloc     19.5G   -
zroot/ada0p3  free      26.0G   -
zroot/ada0p3  ashift    12      -
zroot/ada0p3  state     ONLINE  -

Which would be easier to read and way less (especially when querying multiple vdevs) to type.

Plus this kind information would also be accessible through zpool list, given -r|-d and -o arguments.

allanjude commented 6 years ago

We have been discussing this in the #vdev-properties channel on the OpenZFS slack, and we came up with: zpool vdev-set <property>=<value> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ... zpool vdev-get <property|all> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ... Although with the new way of specifying the vdev, we could probably go back to just using get/set instead of the special vdev version.

We have been through about 5 iterations so far this week ;)

allanjude commented 6 years ago

About your other point, for 'write only' on members of a mirror, the current plan is to implement that something like this: zpool set readbias=ssd1[,ssd2] zroot:mirror-0

That is to say, implement it as a property of the top-level vdev, rather than as a property of the leaf, so that it would persist through a zpool replace etc.

GregorKopka commented 6 years ago

zpool vdev-set = <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ... zpool vdev-get <property|all> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ... Although with the new way of specifying the vdev, we could probably go back to just using get/set instead of the special vdev version.

Apart from the question of what is wrong with / as a path separator, as it's universally used basically everywhere (filesystems, URI zfs datasets, whatnot):

What is the reasoning behind aiming for a flat addressing scheme for a hirarchical data structure?

zpool set readbias=ssd1[,ssd2] zroot:mirror-0 That is to say, implement it as a property of the top-level vdev, rather than as a property of the leaf, so that it would persist through a zpool replace etc.

Why the need to survive a replace?

And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...

GregorKopka commented 6 years ago

@allanjude futher thought on readbias: If it's a tendency, as the word implies, it's useless compared to writeonly - as the latter would make sure no reads at all (unless needed for scrub or resilver) are queued to the tagged device.

Using a 7.2k RPM HDD with 100 IOPS vs. a SSD with 250k read IOPS to make the point: Have one data read queued to the HDD and latency increases four orders of magnitude, should it be a metadata demand read then also throughput drops four orders of magnitude (as all further reads that depend on the metadata can't start).

Bias isn't what I (and I guess everyone else) want. Should readbias work as writeonly, just with reversed tagging: Please name it less misleading to avoid confusion.

allanjude commented 6 years ago

And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...

It was interesting to learn from @grwilson that the queueing happens at the top-level vdev, not the leaf. So that was one of the reasons we decided we didn't need to worry about the leaf vdevs as much.

Currently, leaf vdev ZAPs are empty aside from vdev removal cases.

Based on the discussion, the first implementation will focus on top-level vdevs, and we'll look at leaf vdevs for an expanded version after. The complexity of 'zpool replace', and the promotion/demotion from top level vdev, when you zpool attach/detach and turn single devices into mirrors, or mirrors into single devices, was greater than we expected when we white boarded it at the ZFS DevSummit Hackathon

GregorKopka commented 4 years ago

@allanjude Any progress on this?

GregorKopka commented 9 months ago

And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...

It was interesting to learn from @grwilson that the queueing happens at the top-level vdev, not the leaf. So that was one of the reasons we decided we didn't need to worry about the leaf vdevs as much.

To pick this topic up, a little while later: Yes, the queuing happens on the top-level vdev, but using the information on the leafs (rotational, queue length, ...). Thus the information of "do not read from this, unless needed because no other sources to fulfill the request exist" should IMHO be stored in the leaf, not the parent.

Currently, leaf vdev ZAPs are empty aside from vdev removal cases.

Based on the discussion, the first implementation will focus on top-level vdevs, and we'll look at leaf vdevs for an expanded version after. The complexity of 'zpool replace', and the promotion/demotion from top level vdev, when you zpool attach/detach and turn single devices into mirrors, or mirrors into single devices, was greater than we expected when we white boarded it at the ZFS DevSummit Hackathon

I think something like writeonly does not need to survive a replacement of the leaf (at least I can't see a use case where this would make any sense or is needed - YMMV and if it does: please enlighten me) or turning a mirror into a single drive vdev.