Use XFS quotas to prevent out of space

Using XFS quotas, it is possible to limit the amount of storage that an XFS filesystem can use. This can reduce the likelihood of the underlying thin volumes running out of space.

Hi @DemiMarie, we've discussed this rather extensively. The problem here is snapshots. Even if we use XFS quotas, this does not handle the case where snapshots get involved. While quotas would solve the problem of writing to the filesystem, they do not solve the COW problem when overwriting blocks shared between a filesystem and a snapshot. This could easily fill up a pool and quotas will not be able to detect/stop this. We are having discussions about how to best handle this long term, but for now, we've added in a strict no overprovisioning mode that has the drawback of less efficient snapshot storage, but guarantees that the thin pool will not run out of space. Have you taken a look at that? Does that not meet your requirements?

Have you taken a look at that? Does that not meet your requirements?

I have not, but I believe it does not. I would like there to be a distinction between read-only snapshots (which only use the space required for the snapshot) and writable snapshots (which might use up to the full space alloted for them). I would also like to be able to reserve some space for critical volumes, such as the root and home filesystems.

The reason for this complexity is that the use-cases I have for Stratis are generally on desktop-class hardware operated by ordinary users, not sysadmins. They need something that can pop up a “You ran out of space! What would you like to do?” prompt, not something that requires proactive monitoring. “Grow the thin pool” will generally require purchasing new hardware, and the ability to do so is by no means guaranteed. In fact, I know of very few situations where “grow the thin pool” is actually a viable option, as it assumes that the user magically has some spare storage they are not using and have not allocated to the pool. Outside of large enterprises, I suspect this is actually rare.

This could easily fill up a pool and quotas will not be able to detect/stop this.

There are two solutions I can think of to this (admittedly annoying) problem. The first is to make XFS aware that the underlying storage is thinly provisioned, and the second is deduplication + compression.

Have you taken a look at that? Does that not meet your requirements?

I have not, but I believe it does not. I would like there to be a distinction between read-only snapshots (which only use the space required for the snapshot) and writable snapshots (which might use up to the full space alloted for them). I would also like to be able to reserve some space for critical volumes, such as the root and home filesystems.

With dm-thin, my understanding is that read-only snapshots can actually take up the full space allotted to them if the thin device of which it is a snapshot is writable. Shared references can still be broken and require new allocations if either the snapshot or the source changes. Say for example there's a read only snapshot of a writable thin device and they share 50% of their data. If the shared 50% of the writable thin device is changed, the shared blocks will diverge and now both the read-only snapshot and the writable thin device now require enough storage for all allotted blocks in both. Unless you are enforcing that the origin of the snapshot is also read-only, this will underestimate the amount of storage needed which seems dangerous.

The reason for this complexity is that the use-cases I have for Stratis are generally on desktop-class hardware operated by ordinary users, not sysadmins. They need something that can pop up a “You ran out of space! What would you like to do?” prompt, not something that requires proactive monitoring. “Grow the thin pool” will generally require purchasing new hardware, and the ability to do so is by no means guaranteed. In fact, I know of very few situations where “grow the thin pool” is actually a viable option, as it assumes that the user magically has some spare storage they are not using and have not allocated to the pool. Outside of large enterprises, I suspect this is actually rare.

I may be understanding but I think that's exactly the use case for no overprovisioning. We designed this specifically for desktop-relevant use cases like upgrading the operating system and having a snapshot from before the upgrade to roll back to if the upgrade fails. It removes the need for users to monitor thin pool usage, does not expect users to add more space if the thin pool runs out of space because the worst case scenarios that pop up when the filesystem happily accepts writes that the thin pool cannot accommodate will be prevented at the filesystem level with ENOSPC. Thin pool usage will be bounded by the total size of all filesystems and snapshots so physical space can be exhausted but not exceeded by the filesystem. We could even theoretically add in a D-Bus signal to notify you of when a filesystem has run out of space so that you could listen for that and prompt the user for what they'd like to do. The major drawback is the strict requirements on having enough space for the filesystem and snapshots to diverge 100% which is unlikely in many cases. However, this does provide you with benefits like not needing to monitor thin provisioning physical space usage and protection against cases that could lead to corruption if not handled correctly.

This could easily fill up a pool and quotas will not be able to detect/stop this.

There are two solutions I can think of to this (admittedly annoying) problem. The first is to make XFS aware that the underlying storage is thinly provisioned, and the second is deduplication + compression.

We're discussing trying to implement this interface for the block layer and the filesystem to talk to each other upstream, but this will likely take a long time to implement as versions of this have been attempted before.

Have you taken a look at that? Does that not meet your requirements?

I have not, but I believe it does not. I would like there to be a distinction between read-only snapshots (which only use the space required for the snapshot) and writable snapshots (which might use up to the full space alloted for them). I would also like to be able to reserve some space for critical volumes, such as the root and home filesystems.

With dm-thin, my understanding is that read-only snapshots can actually take up the full space allotted to them if the thin device of which it is a snapshot is writable. Shared references can still be broken and require new allocations if either the snapshot or the source changes. Say for example there's a read only snapshot of a writable thin device and they share 50% of their data. If the shared 50% of the writable thin device is changed, the shared blocks will diverge and now both the read-only snapshot and the writable thin device now require enough storage for all allotted blocks in both. Unless you are enforcing that the origin of the snapshot is also read-only, this will underestimate the amount of storage needed which seems dangerous.

Because a read-only snapshot cannot change, it will never use more space than the origin volume is using at the time it was created. If one has a volume that could have 30GiB, but only has 5GiB currently mapped, then a read-only snapshot will never use more than 5GiB, even of the origin volume later uses the entire 30GiB and all sharing is broken. This is important for read-only snapshots of very sparse volumes, which is an extremely common situation (see below).

The reason for this complexity is that the use-cases I have for Stratis are generally on desktop-class hardware operated by ordinary users, not sysadmins. They need something that can pop up a “You ran out of space! What would you like to do?” prompt, not something that requires proactive monitoring. “Grow the thin pool” will generally require purchasing new hardware, and the ability to do so is by no means guaranteed. In fact, I know of very few situations where “grow the thin pool” is actually a viable option, as it assumes that the user magically has some spare storage they are not using and have not allocated to the pool. Outside of large enterprises, I suspect this is actually rare.

I may be understanding but I think that's exactly the use case for no overprovisioning. We designed this specifically for desktop-relevant use cases like upgrading the operating system and having a snapshot from before the upgrade to roll back to if the upgrade fails. It removes the need for users to monitor thin pool usage, does not expect users to add more space if the thin pool runs out of space because the worst case scenarios that pop up when the filesystem happily accepts writes that the thin pool cannot accommodate will be prevented at the filesystem level with ENOSPC. Thin pool usage will be bounded by the total size of all filesystems and snapshots so physical space can be exhausted but not exceeded by the filesystem. We could even theoretically add in a D-Bus signal to notify you of when a filesystem has run out of space so that you could listen for that and prompt the user for what they'd like to do. The major drawback is the strict requirements on having enough space for the filesystem and snapshots to diverge 100% which is unlikely in many cases. However, this does provide you with benefits like not needing to monitor thin provisioning physical space usage and protection against cases that could lead to corruption if not handled correctly.

There are a few problems with this approach:

XFS cannot (currently) be shrunk. Therefore, if a volume needs a large amount of space now, but then frees up most of its space, the space it frees is effectively leaked.
XFS does not have good support for large amounts of growth. A filesystem that is created at 5GiB will not perform well if it is later extended to 1TiB. Quotas could allow for creating the filesystem at a larger initial size, while ensuring it will not actually use all of the addressible space.
In Qubes OS (which is where all of my experience with thin provisioning comes from), it is very common to have a huge amount of sparse volumes that are rarely used. If the full potential size of such volumes was counted against storage quota, users would run out of space very quickly. On my own system, around half of all VMs use less than a fifth of the space assigned to them. On the other hand, Qubes always creates a new volume on VM startup and commits it after shutdown, so it would be easy for these volumes (except for critical ones) to be marked read-only.

This could easily fill up a pool and quotas will not be able to detect/stop this.

There are two solutions I can think of to this (admittedly annoying) problem. The first is to make XFS aware that the underlying storage is thinly provisioned, and the second is deduplication + compression.

We're discussing trying to implement this interface for the block layer and the filesystem to talk to each other upstream, but this will likely take a long time to implement as versions of this have been attempted before.

I hope this can be done, and that it will be suitable for exposing to untrusted VM guests over virtio and/or xen-blkback.

Have you taken a look at that? Does that not meet your requirements?

I have not, but I believe it does not. I would like there to be a distinction between read-only snapshots (which only use the space required for the snapshot) and writable snapshots (which might use up to the full space alloted for them). I would also like to be able to reserve some space for critical volumes, such as the root and home filesystems.

With dm-thin, my understanding is that read-only snapshots can actually take up the full space allotted to them if the thin device of which it is a snapshot is writable. Shared references can still be broken and require new allocations if either the snapshot or the source changes. Say for example there's a read only snapshot of a writable thin device and they share 50% of their data. If the shared 50% of the writable thin device is changed, the shared blocks will diverge and now both the read-only snapshot and the writable thin device now require enough storage for all allotted blocks in both. Unless you are enforcing that the origin of the snapshot is also read-only, this will underestimate the amount of storage needed which seems dangerous.

Because a read-only snapshot cannot change, it will never use more space than the origin volume is using at the time it was created. If one has a volume that could have 30GiB, but only has 5GiB currently mapped, then a read-only snapshot will never use more than 5GiB, even of the origin volume later uses the entire 30GiB and all sharing is broken. This is important for read-only snapshots of very sparse volumes, which is an extremely common situation (see below).

It appears you're talking about a very specific case: sparse volumes. You're right that no additional space can be mapped in a read-only snapshot so whatever is mapped in the sparse volume is the maximum capacity, not the total size of the volume, but I think I'm actually talking about a different case that we would need to handle if we added in support because we need to make sure that it works in the general case too. You originally said that read-only snapshots only require the space needed for the snapshot. I think what you're getting at (correct me if I'm wrong) is you'd like a special case for sparse volumes. In the case where the snapshot is read-only and fully mapped, our approach is the only way to really enforce that filesystem/snapshot usage doesn't exceed the physical space available. Let's say all sectors are mapped and shared in a read-only snapshot. Once that sharing is broken, it doesn't matter that the volume is read-only, the thin pool will still require enough space to hold all of the space for the read-only snapshot and now also all of the space for the diverging writable thin device. This is the case we handle now. I think you're specifically talking about read-only snapshots and sparse volumes which is a case we do not keep track of. If this is something you're interested in, I can discuss it with the team. We currently don't support either read-only snapshots or sparse volume mapping detection and perhaps that could be beneficial. I'm happy to look into it at the very least!

The reason for this complexity is that the use-cases I have for Stratis are generally on desktop-class hardware operated by ordinary users, not sysadmins. They need something that can pop up a “You ran out of space! What would you like to do?” prompt, not something that requires proactive monitoring. “Grow the thin pool” will generally require purchasing new hardware, and the ability to do so is by no means guaranteed. In fact, I know of very few situations where “grow the thin pool” is actually a viable option, as it assumes that the user magically has some spare storage they are not using and have not allocated to the pool. Outside of large enterprises, I suspect this is actually rare.

I may be understanding but I think that's exactly the use case for no overprovisioning. We designed this specifically for desktop-relevant use cases like upgrading the operating system and having a snapshot from before the upgrade to roll back to if the upgrade fails. It removes the need for users to monitor thin pool usage, does not expect users to add more space if the thin pool runs out of space because the worst case scenarios that pop up when the filesystem happily accepts writes that the thin pool cannot accommodate will be prevented at the filesystem level with ENOSPC. Thin pool usage will be bounded by the total size of all filesystems and snapshots so physical space can be exhausted but not exceeded by the filesystem. We could even theoretically add in a D-Bus signal to notify you of when a filesystem has run out of space so that you could listen for that and prompt the user for what they'd like to do. The major drawback is the strict requirements on having enough space for the filesystem and snapshots to diverge 100% which is unlikely in many cases. However, this does provide you with benefits like not needing to monitor thin provisioning physical space usage and protection against cases that could lead to corruption if not handled correctly.

There are a few problems with this approach:
* XFS cannot (currently) be shrunk.  Therefore, if a volume needs a large amount of space _now_, but then frees up most of its space, the space it frees is effectively leaked.

This seems like the most challenging part to address, and I don't think we could necessarily resolve it without XFS shrink support. I'll discuss this with the team! This seems like something we'd like to address somehow.

* XFS does not have good support for large amounts of growth.  A filesystem that is created at 5GiB will not perform well if it is later extended to 1TiB.  Quotas could allow for creating the filesystem at a larger initial size, while ensuring it will not actually use all of the addressible space.

We're aware of this limtation with XFS, but it seems a little bit like a circular requirement. You've stated that you're targeting desktop use so there's no additional space to allocate to the pool. With our overprovisioning mode, I don't think it really matters that it starts at a small size because based on your requirements, there isn't additional space to provide to the thin pool to allow it to grow in the future so it will never grow and you'd never hit this performance issue. If the filesystem size is bounded by the physical size available on the thin pool and that doesn't change, I'm not really seeing how quotas are an improvement on this.

* In Qubes OS (which is where all of my experience with thin provisioning comes from), it is very common to have a huge amount of sparse volumes that are rarely used.  If the full potential size of such volumes was counted against storage quota, users would run out of space very quickly.  On my own system, around half of all VMs use less than a fifth of the space assigned to them.  On the other hand, Qubes always creates a new volume on VM startup and commits it after shutdown, so it would be easy for these volumes (except for critical ones) to be marked read-only.

Yeah, this is definitely a limitation of our snapshots. Having exclusively writable snapshots exposed as filesystems in stratisd is very useful for some of the use cases we're aiming for, but I can see how that would break down in your case. It would probably be a fair amount of work to get read-only snapshots supported and all of the handling that we'd need to do for that case, but it seems like it could absolutely be helpful for you if we can address the other issues you have.

This could easily fill up a pool and quotas will not be able to detect/stop this.

There are two solutions I can think of to this (admittedly annoying) problem. The first is to make XFS aware that the underlying storage is thinly provisioned, and the second is deduplication + compression.

We're discussing trying to implement this interface for the block layer and the filesystem to talk to each other upstream, but this will likely take a long time to implement as versions of this have been attempted before.

I hope this can be done, and that it will be suitable for exposing to untrusted VM guests over virtio and/or xen-blkback.

XFS does not have good support for large amounts of growth. A filesystem that is created at 5GiB will not perform well if it is later extended to 1TiB. Quotas could allow for creating the filesystem at a larger initial size, while ensuring it will not actually use all of the addressible space.

We're aware of this limtation with XFS, but it seems a little bit like a circular requirement. You've stated that you're targeting desktop use so there's no additional space to allocate to the pool. With our overprovisioning mode, I don't think it really matters that it starts at a small size because based on your requirements, there isn't additional space to provide to the thin pool to allow it to grow in the future so it will never grow and you'd never hit this performance issue. If the filesystem size is bounded by the physical size available on the thin pool and that doesn't change, I'm not really seeing how quotas are an improvement on this.

Quotas allow creating the volume as highly sparse, with a filesystem size much greater than the volume size.

This seems like the most challenging part to address, and I don't think we could necessarily resolve it without XFS shrink support. I'll discuss this with the team! This seems like something we'd like to address somehow.

Could this be solved by adding XFS shrink support? That seems like the simplest and best option.

I apologize for the delay. I've put some thought into this, and while XFS quotas are potentially a valuable feature that we'll consider, they cannot solve the problem of overprovisioning. I've discussed this with @bmr-cymru and the XFS team, and quotas cannot handle the case where snapshots are in use. CoW can still fill up the thin pool and quotas will not prevent this. I've talked with the XFS/devicemapper teams about an interface which could allow XFS to be aware of how much space the block layer has available, but there has been a lot of discussion about the difficult corner cases in implementing this so it will likely take a while. Until this is accomplished, quotas are more of a band-aid on the problem, unfortunately.

Quotas allow creating the volume as highly sparse, with a filesystem size much greater than the volume size.

I understand why someone would want this if you are going to add more physical storage later. That allows you to grow the filesystem to a larger final size later when more physical storage is added. Creating it at a smaller size when the final size will be much larger can cause performance issues. However, you're targeting a desktop use case where more physical storage won't be added so I don't see how this helps you and your use case. Do you see what I'm getting at?

Could this be solved by adding XFS shrink support? That seems like the simplest and best option.

I know XFS has worked on shrink support in the past, but there appear to be other things that XFS is attending to right now that are more urgent. You can definitely let them know that you'd want this feature, but I reached out to them, and it seems like it's unlikely to be implemented soon.

One thing we've discussed and would be happy to implement is the ability to prevent filesystems from growing past a certain point temporarily, either through quotas or just through our handling in stratisd. This seems like it may address some of what you want, but we are still having conversations about how to accomplish the longer term solution.

Quotas allow creating the volume as highly sparse, with a filesystem size much greater than the volume size.

I understand why someone would want this if you are going to add more physical storage later. That allows you to grow the filesystem to a larger final size later when more physical storage is added. Creating it at a smaller size when the final size will be much larger can cause performance issues. However, you're targeting a desktop use case where more physical storage won't be added so I don't see how this helps you and your use case. Do you see what I'm getting at?

I am referring to the case where the user has (say) 1TiB of storage, but creates the volume at only 10GiB because that is all they need right now. Over time, the volume gets resized more and more until eventually it uses (say) 200GiB and there are performance problems.

I've talked with the XFS/devicemapper teams about an interface which could allow XFS to be aware of how much space the block layer has available, but there has been a lot of discussion about the difficult corner cases in implementing this so it will likely take a while.

That would be awesome! Is there anywhere I can find this discussion?

There have been a few different discussions over several years - there was a new thread just last week on dm-devel that's discussing changes to fallocate and the provisioning primitives available when working with thinly provisioned storage.

Although it's at an early stage (and there has been quite a bit of discussion so far between various block and fs folks) this is the sort of work that may eventually lead to improved integration between thinly provisioned block devices and the file system layer.

Quotas allow creating the volume as highly sparse, with a filesystem size much greater than the volume size.

I understand why someone would want this if you are going to add more physical storage later. That allows you to grow the filesystem to a larger final size later when more physical storage is added. Creating it at a smaller size when the final size will be much larger can cause performance issues. However, you're targeting a desktop use case where more physical storage won't be added so I don't see how this helps you and your use case. Do you see what I'm getting at?

I am referring to the case where the user has (say) 1TiB of storage, but creates the volume at only 10GiB because that is all they need right now. Over time, the volume gets resized more and more until eventually it uses (say) 200GiB and there are performance problems.

I want to make sure I'm understanding your concern appropriately. Are you saying in the no overprovisioning case, you would have to accurately guess your eventual data usage (which can be challenging) to avoid running into performance problems?

I want to make sure I'm understanding your concern appropriately. Are you saying in the no overprovisioning case, you would have to accurately guess your eventual data usage (which can be challenging) to avoid running into performance problems?

That is correct.

After discussing this with the team, we are going to draft up some ideas for how quotas could potentially be helpful in the case where overprovisioning is enabled. It seems like without XFS shrink, disabling overprovisioning probably won't suit your use case so I can see quotas being a helpful tool. Just please be aware that this will not help with out of space issues in the snapshot case. That will have to likely be addressed by the discussion @bmr-cymru linked to.

I can see quotas potentially being helpful in the same way partitions can be helpful with the added flexibility of being able to easily change the quota. For example, it will not be able to prevent CoW from filling up the thin pool, but it can stop a runaway process from eating up all of the space on the thin pool.

Does that sound like it could be helpful for you?

Does that sound like it could be helpful for you?

Yes, it does.

Just please be aware that this will not help with out of space issues in the snapshot case.

Some form of quota support at the thin pool level would be amazing, not least because of raw block device support.

Could this be solved by adding XFS shrink support? That seems like the simplest and best option.

I know XFS has worked on shrink support in the past, but there appear to be other things that XFS is attending to right now that are more urgent. You can definitely let them know that you'd want this feature, but I reached out to them, and it seems like it's unlikely to be implemented soon.

Is there a github repo for XFS regarding requesting "shrink" support, or just a mailing list?

I'll jump on the bandwagon for shrink support. It's a semi-major deficiency, since EXT4 has this (though not online supported)

@MrPippin66 I would recommend talking to them on the mailing list.

stratis-storage / stratisd

Use XFS quotas to prevent out of space #3030