Open GregorKopka opened 8 years ago
To bump this a little after the OpenFS Summit talk about vdev properties (youtube)
@allanjude Thanks for working on this. I think the user interface (as presented) needs a different approach.
As the vdevs are a hirarchical tree structure, wouldn't it make more sense to extend the existing interface of zpool get/set/list by allowing the [pool] argument of them to be poolname[/vdev-name[/...]]
and adding -r and -d to recurse over children ?
This would change the example in your talk from
zpool get size@ada0p3,alloc@ada0p3,free@ada0p3,ashift@ada0p3,state@ada0p3 zroot
NAME PROPERTY VALUE SOURCE
zroot size@ada0p3 45.5G -
zroot alloc@ada0p3 19.5G -
zroot free@ada0p3 26.0G -
zroot ashift@ada0p3 12 -
zroot state@ada0p3 ONLINE -
to
zpool get size,alloc,free,ashift,state zroot/ada0p3
NAME PROPERTY VALUE SOURCE
zroot/ada0p3 size 45.5G -
zroot/ada0p3 alloc 19.5G -
zroot/ada0p3 free 26.0G -
zroot/ada0p3 ashift 12 -
zroot/ada0p3 state ONLINE -
Which would be easier to read and way less (especially when querying multiple vdevs) to type.
Plus this kind information would also be accessible through zpool list
, given -r|-d and -o arguments.
We have been discussing this in the #vdev-properties channel on the OpenZFS slack, and we came up with:
zpool vdev-set <property>=<value> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ...
zpool vdev-get <property|all> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ...
Although with the new way of specifying the vdev, we could probably go back to just using get/set instead of the special vdev version.
We have been through about 5 iterations so far this week ;)
About your other point, for 'write only' on members of a mirror, the current plan is to implement that something like this:
zpool set readbias=ssd1[,ssd2] zroot:mirror-0
That is to say, implement it as a property of the top-level vdev, rather than as a property of the leaf, so that it would persist through a zpool replace
etc.
zpool vdev-set
= <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ... zpool vdev-get <property|all> <pool1:vdev1,vdev2,vdev3,...> <pool2:vdev4,vdev5,vdev6> ... Although with the new way of specifying the vdev, we could probably go back to just using get/set instead of the special vdev version.
Apart from the question of what is wrong with / as a path separator, as it's universally used basically everywhere (filesystems, URI zfs datasets, whatnot):
What is the reasoning behind aiming for a flat addressing scheme for a hirarchical data structure?
zpool set readbias=ssd1[,ssd2] zroot:mirror-0 That is to say, implement it as a property of the top-level vdev, rather than as a property of the leaf, so that it would persist through a zpool replace etc.
Why the need to survive a replace?
And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...
@allanjude futher thought on readbias: If it's a tendency, as the word implies, it's useless compared to writeonly - as the latter would make sure no reads at all (unless needed for scrub or resilver) are queued to the tagged device.
Using a 7.2k RPM HDD with 100 IOPS vs. a SSD with 250k read IOPS to make the point: Have one data read queued to the HDD and latency increases four orders of magnitude, should it be a metadata demand read then also throughput drops four orders of magnitude (as all further reads that depend on the metadata can't start).
Bias isn't what I (and I guess everyone else) want. Should readbias work as writeonly, just with reversed tagging: Please name it less misleading to avoid confusion.
And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...
It was interesting to learn from @grwilson that the queueing happens at the top-level vdev, not the leaf. So that was one of the reasons we decided we didn't need to worry about the leaf vdevs as much.
Currently, leaf vdev ZAPs are empty aside from vdev removal cases.
Based on the discussion, the first implementation will focus on top-level vdevs, and we'll look at leaf vdevs for an expanded version after. The complexity of 'zpool replace', and the promotion/demotion from top level vdev, when you zpool attach/detach and turn single devices into mirrors, or mirrors into single devices, was greater than we expected when we white boarded it at the ZFS DevSummit Hackathon
@allanjude Any progress on this?
And why put that into the mirror vdev instead of the member where it actually belongs (don't read from this, unless needed) so that the code picking the mirror disk to read from (which is already working with the leafs to look at queue depths and whatnot) can easily access it and quickly test for it? Rotational and other flags relevant to the disk are also in the mirror vdev leafs, not their parent node...
It was interesting to learn from @grwilson that the queueing happens at the top-level vdev, not the leaf. So that was one of the reasons we decided we didn't need to worry about the leaf vdevs as much.
To pick this topic up, a little while later: Yes, the queuing happens on the top-level vdev, but using the information on the leafs (rotational, queue length, ...). Thus the information of "do not read from this, unless needed because no other sources to fulfill the request exist" should IMHO be stored in the leaf, not the parent.
Currently, leaf vdev ZAPs are empty aside from vdev removal cases.
Based on the discussion, the first implementation will focus on top-level vdevs, and we'll look at leaf vdevs for an expanded version after. The complexity of 'zpool replace', and the promotion/demotion from top level vdev, when you zpool attach/detach and turn single devices into mirrors, or mirrors into single devices, was greater than we expected when we white boarded it at the ZFS DevSummit Hackathon
I think something like writeonly
does not need to survive a replacement of the leaf (at least I can't see a use case where this would make any sense or is needed - YMMV and if it does: please enlighten me) or turning a mirror into a single drive vdev.
It would be beneficial to allow zpool get/set operations to target vdevs and their members.
Get ashift of pool (https://github.com/zfsonlinux/zfs/issues/2024):
Allow features like
No new allocations on vdev (for whatever reasons), or even automatically when degraded (manually set, or while replace of member disk is running) while still updating superblocks to keep the pool consistent:
Do not serve reads from this vdev member while redundancy allows it (slow drive in mirror for more cheap redundancy, https://github.com/zfsonlinux/zfs/issues/1742):
This would also deliver the interface to tune features like Metadata Allocation Class (https://github.com/zfsonlinux/zfs/issues/3779)
Combined this would allow for interesting performing setups, (example) pool with: