openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.67k stars 1.76k forks source link

Support linear zpool #16754

Open likan999 opened 1 week ago

likan999 commented 1 week ago

Describe the feature would like to see added to OpenZFS

I'd like to see ZFS to support marking group(s) of vdevs to be linear. A group of linear vdevs are considered to be on the same physical device so ZFS should avoid striping when writing to them.

To implement this feature, no change needs to be made to disk format, except adding some pool level metadata that remembers these groups. At the allocation time, treat all vdevs as a single, linear space; or alternatively, write sufficiently large amount of data to one vdev before moving to the next one. The latter approach sounds like similar to setting metaslab_aliquot to a large value, but the module parameter is global and I'd like this to be applied only within the group, at least this should be per-pool basis.

How will this feature improve OpenZFS?

I believe this will enable some very useful and interesting use cases.

  1. Support disks of different sizes. For home users, this is a very common situation, and currently only unRAID support this well but it is not a free software, and it doesn't have cool features in ZFS such as snapshot, etc. If ZFS can handles this, then there is no need to use unRAID.

To see how this feature allows supporting disks of different sizes, let's assume there are four disks: A is 6TB, B is 10TB, and C and D are 12TB. This is not uncommon in homelab setup, where we incrementally add new drives and the ones on the market are bigger over time.

We can then divide B into 6TB + 4TB, C and D into 6TB + 4TB + 2TB. Now we can form a zpool with three top level vdevs 6TB x 4 as raidz, 4TB x 3 as raidz, 2TB x 2 as mirror. Without this feature, it works, but performance would suffer, because it stripes between these three vdevs, so when doing a sequential write, it will constantly seek between the three vdevs, thus quite slow and wears out the disk surface. With this feature, then this will work perfectly.

  1. Support online shrinking of a raid pool (well, not perfect, but still better than the send and receive approach).

Imagine we have 4 10TB disks, instead of setting up them as one raidz vdev of 10TBx3, we can divide each disk into 10 1TB partitions, and create 10 linear top level raidz vdevs, and each of them are 1TBx3. By marking them as linear, then in the future, we can shrink the pool to 9TBx3 by removing the last vdev, and 8TBx3 by removing the last 2 vdevs, etc. We can even get something like 7.5TBx3 by removing the last 3 vdevs, and then add 0.5TBx3.

I would say these two use cases are two important pain points of ZFS for many people and if it was supported, many of them would've chosen ZFS.

Additional context

I'm not familiar with ZFS internals, but if everyone else is very busy, I'm happy to implement and send PR, if some one can give me some guidance on how to start working in this project.

Harry-Chen commented 1 week ago

Seems that you can achieve this in both cases by simply not creating the "to-be-used-later" vdev until running out of space.

likan999 commented 1 week ago

Seems that you can achieve this in both cases by simply not creating the "to-be-used-later" vdev until running out of space.

Well this puts a rather strong restriction on how the pool is used and needs either manual intervention when disk runs out of space or write monitoring programs to automate this.

More specifically, your proposed approach assumes files are only added, not modified or deleted. It's true for some people but that's a strong restriction for most people. If I already have more than one vdevs and then mutate existing data, then new data will still be stripped to different vdevs.

Also IIRC, zfs performs badly (e.g. fragmentation) when the space is close to full. So you can't wait for it to run out of space.

amotin commented 1 week ago

This feels like a hell for management and data recovery. I can think of several tunables (like metaslab_aliquot and spa_num_allocators) that could achieve something towards the wanted behavior already, but one thing neither them nor the proposed design would handle is a having shared I/O scheduler queue for all the vdevs belonging to the same physical disk.

likan999 commented 1 week ago

This feels like a hell for management and data recovery. I can think of several tunables (like metaslab_aliquot and spa_num_allocators) that could achieve something towards the wanted behavior already, but one thing neither them nor the proposed design would handle is a having shared I/O scheduler queue for all the vdevs belonging to the same physical disk.

Good point for the shared IO scheduler queue. I didn't think about it and I appreciate you brought it up, and it's a nice thing to have. However, it makes the point stronger that it needs some support in ZFS rather than existing tunables.

Can you elaborate a little more about the management and data recovery and how it messes up the ZFS internals?

For management, did you mean it makes CLI more complicated? It could be as simple as a prop on vdev, in which the vdevs of the same value are in the same group. Ordinary user wouldn't need to care about it, while advanced user could set it to enable the feature. In this way there is no need to introduce any new subcommand to the CLI.

If you mean it is difficult for user to set up the zpool in this way, then for use case 1, it is a one-off thing, and is a much better experience than what Harry suggested in terms of management; for use case 2, it is only a little pain at setup time, while to do shrinking, a simple zpool remove would suffice, and it is better than send/receive, because for send/receive, you need to find a sufficiently large temporary storage and need to worry the data get corrupted along the way.

Keep in mind this is for users who have the requirements, so they pay a little inconvenience for all the benefits they get; for ones who don't need it, it won't make any difference anyway. Also for both use cases, people could write reusable script or program to automate related operations.

For data recovery, there is no disk format change, therefore not affecting scrub/resilver/... except the shared IO scheduler you mentioned.

If you mean putting vdevs on the same physical drive makes it easier for data loss. I'd say today nothing prevents people from doing that, so it doesn't make things worse, on the flip side, it reduces disk seeks, which potentially makes drive last longer. For both use cases I gave, they all tolerate single disk failure, which is not less secure than a normal raidz setup; therefore with a proper set up, it wouldn't harm the data safety. Of course, when doing a replace, you need to run multiple replace commands now, but that's not a huge pain IMO, and again can be automated.

In fact, by letting ZFS aware of underlying physical drive, it opens up other opportunities as well, e.g. warn users if their setup can't tolerate single disk failure. Also, in the future, copies=N can avoid putting data in the same physical drive, which makes data safer, but this is out of the scope here.

Again I am happy to contribute this feature, but I need a thumb up before I really spend hours on it. I don't want to spend many hours on it and only got a rejection.

amotin commented 1 week ago

Can you elaborate a little more about the management and data recovery

When one of the disks fail, somebody will need to recreate exactly the same configuration to replace it. If replacement disk happen to have different size, since time has passed, I bet you'll want some even weirder topology in addition to existing one. If after all this you get some bad accident and will need to contact some data rescue company, it will be quite an explanation of where to look for your data, especially after few years passed since you've set it up. Don't complicate your life, it is already complicated enough.

Harry-Chen commented 1 week ago

Also IIRC, zfs performs badly (e.g. fragmentation) when the space is close to full. So you can't wait for it to run out of space.

I must say it remains true for your proposal, i.e. using vdevs in a pre-defined order.

likan999 commented 1 week ago

Can you elaborate a little more about the management and data recovery

When one of the disks fail, somebody will need to recreate exactly the same configuration to replace it. If replacement disk happen to have different size, since time has passed, I bet you'll want some even weirder topology in addition to existing one. If after all this you get some bad accident and will need to contact some data rescue company, it will be quite an explanation of where to look for your data, especially after few years passed since you've set it up. Don't complicate your life, it is already complicated enough.

Let's take the use case 1 as an example. First of all, there is no need to discuss the case where the replacement disk is smaller, as it will break other zpool setups as well.

If the disk getting bigger is the largest one, C or D, then the additional spaces can't be used in the zpool. This is the same as unRAID, so I don't see this is a big problem.

Let's assume disk A gets replaced with a bigger one, it can be 10TB or 12TB. For any sizes in between, e.g. 11TB, the 1TB additional spaces can't be added to the pool, which is fine. Let's take A 6TB->12TB as an example, other cases, such as A->10TB, B->12TB, are similar.

Recall the original configuration is A: 6TB, B: 6TB+4TB, C&D: 6TB+4TB+2TB, and we have 6TBx4 raidz, 4TBx3 raidz and 2TBx2 mirror. Now it becomes A: 6TB+4TB+2TB, B: 6TB+4TB, C&D: 6TB+4TB+2TB. All we need to do is to run replace to recover the 6TBx4 raidz, then we can rely on the feature that will be released in 2.3 to turn 4TBx3 raidz into 4TBx4 raidz, and 2TBx2 mirror to 2TBx3 raidz.

In this way, the new topology is not getting worse over time. If there are sufficient free space, we can even remove the 4TBx3, resize partitions, and rely on autoexpand=true to pick up the new spaces, and the pool becomes 10TBx4 raidz + 2TBx3 raidz, which is simplified.

All the above steps can be automated and it is fairly simple to write some scripts to do that and share to the community.

likan999 commented 1 week ago

Also IIRC, zfs performs badly (e.g. fragmentation) when the space is close to full. So you can't wait for it to run out of space.

I must say it remains true for your proposal, i.e. using vdevs in a pre-defined order.

Not really, for your suggestion, allocator are forced to find spaces in small holes in the existing pools, which caused fragmentation; with more vdevs added beforehand, when existing vdevs doesn't have sufficiently large continuous free spaces, it can allocate from other vdevs. As I said, it treats the group as a single continuous space, so it is not forced to completely fill up the 1st one before going to the 2nd one.

Also in my suggestion, I also mentioned an alternative, which is still evenly writes to vdevs, but with a large stripes, and this will also work and not suffer from the fragmentation issue.

Harry-Chen commented 1 week ago

Not really, for your suggestion, allocator are forced to find spaces in small holes in the existing pools, which caused fragmentation; with more vdevs added beforehand, when existing vdevs doesn't have sufficiently large continuous free spaces, it can allocate from other vdevs.

Yes -- but there will eventually be a threshold (either hardcoded or configurable) to decide when the allocator starts to use a non-empty vdev. And once a vdev is used, either by manually attaching or your linear scheme, online shrinking is not easy.

likan999 commented 1 week ago

Not really, for your suggestion, allocator are forced to find spaces in small holes in the existing pools, which caused fragmentation; with more vdevs added beforehand, when existing vdevs doesn't have sufficiently large continuous free spaces, it can allocate from other vdevs.

Yes -- but there will eventually be a threshold (either hardcoded or configurable) to decide when the allocator starts to use a non-empty vdev.

And once a vdev is used, either by manually attaching or your linear scheme, online shrinking is not easy.

Well, if I fill up 90% of my 10TBx3 spaces, I deserve the fragmentation, that's my problem. But I don't deserve the fragmentation when I only have 90% of 1TBx3 data.

robn commented 1 week ago

I do enough recovery work on pools where people tried to be too clever, so I appreciate this:

This feels like a hell for management and data recovery. Don't complicate your life, it is already complicated enough.

That said, I think your original description lays out some of the moving parts required and provides a guide to getting here:

To implement this feature, no change needs to be made to disk format, except adding some pool level metadata that remembers these groups. At the allocation time, treat all vdevs as a single, linear space; or alternatively, write sufficiently large amount of data to one vdev before moving to the next one. The latter approach sounds like similar to setting metaslab_aliquot to a large value, but the module parameter is global and I'd like this to be applied only within the group, at least this should be per-pool basis.

Breaking this down, this seems like you want:

  1. a way to declare that a set of vdevs belong to the same group
  2. a way to set a policy that applies to all members of that group
  3. implementation of a metaslab_aliquot-like vdev group policy

1 and 2 are both wanted in other contexts. We can already set arbitrary properties on a vdev, but we haven't yet settled on how to do property inheritance. "Groups" are one possibility (I have a prototype of something similar, that I call "templates', but it's not ready to show and I wouldn't care if something different came along).

These facilities, built in good faith with the involvement of the rest of the community, will give you the tools you need to implement 3, and probably along the way you'll gain the knowledge you need to know whether or not it's something that makes sense, or should be done differently.

likan999 commented 6 days ago

Thanks Rob for your inputs.

  1. a way to declare that a set of vdevs belong to the same group
  2. a way to set a policy that applies to all members of that group
  3. implementation of a metaslab_aliquot-like vdev group policy

And according to Alexander, there is a

  1. Shared IO scheduler for vdevs in the same group. This is optional and can be deferred for future versions.

1 and 2 are both wanted in other contexts. We can already set arbitrary properties on a vdev, but we haven't yet settled on how to do property inheritance. "Groups" are one possibility (I have a prototype of something similar, that I call "templates', but it's not ready to show and I wouldn't care if something different came along).

Do you want to give some examples of the "other contexts" you mentioned? The solution I proposed was very specific to my use cases, e.g. the prop is only meaningful for top-level vdevs, so inheritance is not a problem. I'd like to hear the requirements from other contexts so we can come up with a uniform and future-proof approach that covers as many use cases as possible, to avoid adding more chaos to the project.