openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.3k stars 1.71k forks source link

Interblock Efficiency Techniques #15685

Open NicholasRush opened 6 months ago

NicholasRush commented 6 months ago

For an efficient data saving, zfs should implement interblock storage efficiency techniques like compression, deduplication and data compaction and zero detection.

Interblock level means, that this technique don´t work at recordsize or volblocksize level.

Data compaction: This technique combines multiple physikal blocks, that are not fully used to one physikal block. For example you have blocks like 4x 1KB, that would use 4 separate 4KB blocks on pyhsikal media, they would combined to fit in one phyisikal 4KB block, during write.

Deduplication: Dedup is implementent in ZFS, but works on the stripe width like recordsize or volblocksize. This make it far away from efficient from data saving perspektive, because deduplication work at its best with small blocks and the limitation of the recordsize (stripe size) to 4k is no option, because of space wasting.

Compression: Compression works very good in ZFS, like deduplication it relies on the record- or volblocksize (stripe size) to compress data. The problem here is the same like with deduplication, if you want a good compression ratio, you need a big stripe size. But this interfer with the current deduplication implementation. Compression should also work with variable compression group sizes, to maximize data savings on the disk.

Zero detection: Zero detection would be also a great enhancement, because ZFS shouldn´t write data to disk that is empty. This would be a great improvment for work with vm disks or other sparse files, that contain much zeros to reserv the space.

The Datapath with Inline compression and deduplication: Incoming Data ---> detect zeros and strip them ---> Compression ---> Deduplication ---> Compaction ---> Build Stripe with Checksum ---> Disk Write

All the techniques that i have described here could be paired with a dynamic data analythics in the write path, that scans the data for compressible or deduplicatable data and monitors the the cpu usage of zfs. If the data scans in ram would be decide, that incoming data is not compressible or deduplicatable in an amount of time, the data should be written without deduplication or compression to disk.

Finally these better techniques could be also much better, if ZFS supports internal storage tiering. So data could be compressed with different better algorithm on it´s access watermark later if the system is not much busy. But that would requiere block pointer rewrite.


To get the most out of deduplication and zero detection, it is nesseary to have an own deduplication table per dataset and zvol. Every dataset and zvol should work like a standalone zpool. That needs a new abstraction layer in zfs to get this working. In conclusion that means, if these technqiues should be implemented in zfs, the on disk format would change. A Migration to the new format would require a new pool with a zfs send and receive. An inplace migration is not possible.

But there are currently many features that are implemented or on its way and not solved very well, like block cloning, raidz expansion or encryption. And every new feature need fixes in existing features, because they all have to work with each other.

Maybe it could be a chance for OpenZFS version 4.0 to get all these features nativly in a complete rewritten and rethinked codebase, that has nothing more to do with the initial ZFS Version 28 from Sun. That also would be a chance to get away from the license of the codebase and a chance to implement block pointer rewrite from the ground up. The only thing that ZFS 4.0 has to do with the initial Version 28 is then the name and nothing more.

amotin commented 6 months ago

@NicholasRush Could you please focus on something achievable in this life time, rather than proclaiming some revolutionary thesises "for everything good against everything bad"?

ZFS already does zero detection as first pass of compression stage, so you can mark that as done. Most of the rest of your points sound unrealistic in random write context: you can not compress/dedup/compact multiple small blocks together and then be able to randomly access/overwrite them. And if you are not going to randomly overwrite them, then just increase your recordsize/vloblocksize and close this topic.

NicholasRush commented 6 months ago

@amotin There isn't any unrealistic in my feature request. The things below the line could be an unrealistic future, but you can definitely do that, what i describes there.

It is a process that could make zfs more efficient and doesn't reflect the current state.

But the current state is for me like an old house, that wants to look like a new house. And that work on one side, but the foundation stay the same. And if you want to add a new bathroom, you have to rework all the parts that you have renovated before. And the same goes on in zfs. Maybe you can reuse older parts in the zfs on disk format, but every feature is a problem if the on disk format doesn't change from time to time. If you implement feature by feature and never rethink the whole thing that you build by reuse fields or functions that in the past used by other features you end up with a software that consumes more and more time in development. And you limit other functions that have worked in the past, because there are part of the foundation.

Compaction for example is done on the physical layer, if the write aligator would write out a stripe with blocks that are not complete full on disks with a physical block size of 4K. And you normaly have a lot of padding, because you have to fill up the physical block. Then compaction comes into place and combines these blocks to lower the padding. Of course this need to be reflectet in the metadata. And compaction is not a form of compression. It combines blocks that would not completly fill up the physical block size. And this have to be referenced in the resulting metadata.

The process that i have rethinked how compression and deduplication could better work is more the summary of an technical analysis of techniques that are today working in the storage system of the big players of the storage market. ZFS could be the best of that, but to be that it has to learn from it's past and rethink it's future.

The problem is, that zfs doesn't have the features that the big storage vendors have in it's filesystems. But instead zfs is open source and these filesystems are not. But on the storage market zfs has the least features. And the problems i try to solve with this feature request.

It's very sad that a feature request gets so much rejection. But you should actually be grateful that many people want to get involved in a project like this.

amotin commented 6 months ago

@NicholasRush I am not rejecting any ideas, merely trying to be realistic and apply my years of experience. Do you plan to work on these improvements personally? Have you though it in depth, or you propose somebody else to do it, assuming nobody ever thought about it before?

"For example you have blocks like 4x 1KB" -- not a very good example. 1KB blocks may realistically appear only when storing enormous quantities of extremely small objects/files. I am sure there are some workloads of that kind, but I doubt much.

"because deduplication work at its best with small blocks and the limitation of the recordsize (stripe size) to 4k is no option" -- deduplication is extremely expensive on so small blocks. Deduplication works on the level of blocks since that is where checksums are calculated, pool space is allocated and later copy-on-write happen. How exactly do you propose to deduplicate several blocks if they appear non-consecutive on disks? How do you propose to handle modification of one of those blocks in one of duplicates, etc.

And generally ZFS becomes very slow and inefficient at small blocks. I would not recommend blocks less than at very least 64KB for deduplication and less than 16KB in general for anything. And at that level compression is already efficient without any multi-block magic.

Haravikk commented 5 months ago

Is what you're thinking of for compression something like I've described in #13107?

ZVOLs can already see substantially better compression than the same files in a dataset, because a ZVOL operates in blocks of volblocksize so there's a minimum amount of data to compress, which usually means better savings, whereas files in a regular dataset only have a minimum of the physical block-size. So a large ZVOL volblocksize can give great compression, but it can also cost some performance, not to mention the overhead of a second filesystem on top.

My proposal was intended to describe a possible method for achieving the same thing for regular datasets by having small records stored together in larger "group records" but with similar trade-offs (have to load the entire group record to read something within it, have to somehow retire group records if they end up with too many "holes" ("deleted" sub-records).

It could definitely allow for improved storage efficiency with the right content, and mixed performance (with highly compressible data you could actually see a boost by reading/writing less from/to disk, but otherwise it could be a bit worse).

NicholasRush commented 5 months ago

@Haravikk

It's not exactly the same but a precursor to it.

What I mean exactly are techniques like these: https://community.netapp.com/t5/Tech-ONTAP-Articles/A-Look-Inside-NetApp-Inline-Data-Compaction/ta-p/122362

Haravikk commented 4 months ago

Ah, I see, so this is more about the block level (clue was in the name wasn't it, d'oh!).

That's interesting, as while it technically still has the same fragmentation issue as doing it at a higher level, since ashift already limits the minimum size of writes (resulting in tiny little unused gaps for very small files), the worst case would be that you end up with the same kind of unused gaps as before, so we can just ignore them until they're completely empty.

So the worst case is "same as now", compared to grouping/combining at the record level which can result in some losses to fragmentation over time without a way to detect and recreate the wasteful ones.

I do wonder how easy that would be to implement in ZFS, as I think it's pretty heavily built up from the assumption of ashift being the smallest unit, and a lot of features operate on records rather than blocks, which could complicate things?

NicholasRush commented 4 months ago

@Haravikk Here is a spreadsheet where you can better see what it does and how it works.

https://www.flackbox.com/wp-content/uploads/2017/03/9a.webp

Here is the full article: https://www.flackbox.com/netapp-deduplication-compression

All the technology makes the content of the written strips more efficient. Data compaction only takes effect on the ahsift disk layer.

There are many technologies in WAFL that would only benefit ZFS.

Likewise adaptive and secondary compression or for ZFS the variable grouping of chunks.

Deduplication would also benefit from this, because ZFS unfortunately only deduplicates the entire raid strip (recordsize or volblocksize in ZFS), which is not really efficient. Here the path that Netapp took with data deduplication would be just as good for ZFS.

For ZFS with inline deduplication, this currently means that only data in main memory is compared and deduplicated. With a small reserved memory in main memory for deduplication table for checksums of previously written blocks. Unfortunately, ZFS does not support block pointer rewrite, so it is not possible to scan the data in the pool in the background for duplicate blocks and release them.

Haravikk commented 4 months ago

I read the article, but the problem is that ZFS' pointers will only currently have precision down to the physical block (ashift) level, so has no ability to address a location within the block, which would be required to reference records that were compacted. NetApp presumably was designed with, or changed to add, the ability to address these somehow?

To do this in a way that is "invisible" to the rest of ZFS, as you say "at the ashift layer", would mean building and maintaining some kind of table so that different logical blocks stored in the same physical block can be separated at that level. Not impossible, but it also needs to be stored in a way that maintains ZFS' guarantees on integrity, atomicity etc., and it would mean a table lookup for every single block accessed.

The other alternative I can see is adding some kind of optional offset to block pointers, so pointers for different logical blocks can point to the same physical block, but with different offsets for where their data starts within it. This way the data is stored within existing ZFS structures with all the usual guarantees etc.

But in both cases you also have the added problem of tracking which parts of which sub-divided blocks are in use, so they can be cleared only when all data within them is cleared, and re-used either only once they're completely empty, or somehow used for new sub-blocks that might fit within them. It's possible in the "offset" case reflinking (as for de-duplicated blocks) might be good enough to handle the "freed when empty" case?

What would really help is some kind of data on how much of a benefit this feature might actually provide? I did a quick search and couldn't find anyone discussing how much space you might expect to save as a result of NetApp's version of data compaction? Are you a NetApp user? It would be helpful to know what space a good sample filesystem occupies both with and without the feature enabled, as if it's only going to result in small savings here and there it may not be worth it.

Don't get me wrong, some of the same issues exist for my group record proposal, albeit implemented at a higher level, though it would in a way solve the same basic problem (as multiple records being compacted together would mean they should also be inadvertently compacted at the physical block layer). It's a lot easier to demonstrate the benefits of that as we have ZVOLs already, so the same filesystem as a dataset and as a ZVOL, with volblocksize set to match recordsize and same compression settings, will show the ZVOL achieve a much better compression ratio.

NicholasRush commented 4 months ago

Here I have picked out several articles on how WAFL handles the on-disk format.

https://www.usenix.org/system/files/fast19-kesavan.pdf https://www.usenix.org/system/files/atc19-kesavan.pdf https://www.usenix.org/system/files/conference/osdi16/osdi16-curtis-maury.pdf https://www.usenix.org/system/files/conference/fast17/fast17-kesavan.pdf https://www.usenix.org/system/files/conference/fast18/fast18-kesavan.pdf https://www.usenix.org/system/files/conference/hotdep14/hotdep14-jaffer.pdf https://www.netapp.com/media/23892-sw-WAFL.pdf

I have been using Data Ontap with the WAFL file system in a professional environment for over 15 years. The changes in ZFS to enable more efficient data storage would be that in addition to the zVols and datasets, an additional volume type such as "Flexvol" and "Flexdataset" could be added. I can't determine exactly how many layers need to be added to ZFS to make this possible. However, when I look at the ZFS source code for the individual areas, each area seems to need adjustments to incorporate this technology.

In WAFL, block addressing works via PBN and VBN block addressing. Here PBN references the physical block on the disk. VBN is the logical virtual block. At least that's how I understood the documentation.

The information about Netapp's storage guarantee is very close to reality. In an all-flash system where aggregate-wide deduplication is active, you can easily achieve a 4:1 ratio when storing virtual machines. This is currently not possible at all with ZFS. In addition, due to the design, there is no configurable stripe size in WAFL aggregates. So no volblocksize or recordsize, as it is simply not needed. The stripe size in the WAFL file system is set automatically and is just as variable as in ZFS, but the smallest possible inode is 4K and these 4K inodes are then written to the aggregate in a stripe of variable length, with 4K blocks per disk. It's not wrong to do it that way either. Conceptually it is better solved than in ZFS.

There is also an explanation of how this was implemented in WAFL for Raid-DP:

Expandable Arrays: [...] When we underpopulate an array, we are taking advantage of the fact that given fewer than p − 1 data disks, we could fill the remainder of the array with unused disks that contain only zeros. [...] This allows us to expand the array later by adding a zero-filled disk, and adjusting parity as we later write data to that disk.

Source: https://www.usenix.org/legacy/events/fast04/tech/corbett/corbett.pdf

I know that at first it sounds like it has nothing to do with the format described above, but it is interesting that you can expand a Raid-DP like this without having to rewrite all the data, unlike Raid-Z1/2/ 3. But it is also a reason why you don't need to specify a stripe size in WAFL.