Combined/Grouped Records to Maximise Compression

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.68k stars 1.75k forks source link

Combined/Grouped Records to Maximise Compression #13107

Open Haravikk opened 2 years ago

Haravikk commented 2 years ago

Describe the feature would like to see added to OpenZFS

I would like to see the ability to store "combined" records, whereby multiple smaller records are stored as if they were a single larger one, in order to make more effective use of filesystem compression in a similar manner to ZVOLs.

How will this feature improve OpenZFS?

It will allow datasets to achieve higher compression ratios when storing smaller records that compress less well on their own, but compress well when treated as a single large "block".

It may also allow for improved performance in raidz configurations, as grouping records can result in a wider grouped record that can be more efficiently split across disks. This improvement would be most visible when dealing with smaller records that currently result in sub-optimal writes to these devices.

Additional context

The inspiration for this request came after I migrated the contents of a plain ZFS dataset to a ZVOL while attempting to debug an unrelated issue. While my compressratio on the plain dataset was an okay 1.19, on the ZVOL the exact same content achieved a ratio of nearly 2.0 with all else being equal.

This makes sense, as ZVOLs effectively have a large minimum "record" size owing to their volblocksize (in my case 128k), whereas in an ordinary dataset records can be as small as the minimum physical block size determined by the ashift value (4k in my case). Most compression algorithms can only achieve limited savings on smaller amounts of data, and they tend to work a lot better the more compressible data you can give them, and that is the case here.

I would propose that the feature works something like the following:

With an appropriate setting enabled (e.g- grouprecordsize=128K), ZFS will delay final writes to disk so that smaller records can be grouped together into a larger record of the specified size (up to, but no larger than recordsize). The final combined record size may still be smaller than the target, as it is only a best effort using available written data.
Another setting may determine whether this will apply to data, metadata or both (e.g- grouprecordtype=all|data|metadata). The default would be both, but this allows tuning for performance vs. size on metadata records.
Once enough small records are buffered they will be written out as a single combined/group record as if they were one larger record, and compressed accordingly as a group (rather than individually).
Metadata for the records will point to the location of the group record and an offset within it where the individual specific record can be found.
To read a single sub-record within a group, the entire group record is retrieved and decompressed so that the smaller record can be retrieved (same as happens for a small file within a ZVOL block, except that the awareness of the sub-record is handled by ZFS itself, rather than a secondary filesystem).
As old individual sub-records are freed this will create "holes" within group records. Once the amount freed exceeds some limit (e.g- grouprecordrewrite=64K) then all of its remaining individual records would be queued for writing as if they had been updated, allowing the old group record to be freed once a new one has been written out.

In essence the idea is to give ZVOL-like storage performance in ordinary datasets, with some of the same basic caveats; i.e- additional data needs to be read to access a single small record, and more data may need to be written. However, in the latter case by only rewriting when old records become too fragmented this should be less common than it is for ZVOLs so some of the reduced write performance would be mitigated.

While this feature should not be enabled by default for a variety of reasons, properly tuned its performance shouldn't be any worse than a ZVOL would be, and in some cases write performance should be better than a ZVOL, since there should be less unnecessary copying as a result of small writes (compared to an actual ZVOL where ZFS has no awareness of the contents), and there would be no need for an entire secondary filesystem to be involved.

In fact it may be possible to leverage existing zvol code to implement this, i.e- a grouped record would simply be a zvol block, and extracting data would function the same as reading only part of the block. The main differences would be that writing may prefer to create new blocks rather than updating old ones (though this is not a requirement) and there would need to be a special metadata flag or format so that metadata can be stored in zvol blocks, and reference locations within other blocks.

GregorKopka commented 9 months ago

Default recordsize (zfs internal logical file block size) for filesystems is 128k and ZFS already works exactly like you're asking for on reads or writes that are smaller than the recordsize of a file was created with.

The way compression works in zfs is identical between data stored in a file and a zvol, a filesystem dataset (containing only one file holding the same data as a zvol) showing a worse compression ratio is to be expected, as a filesystem has additional metadata (directory, permissions, modification time, ...) that is factored into the compression ratio of the whole dataset.

Please close as this asked for existing functionaliy.

Haravikk commented 9 months ago

Default recordsize (zfs internal logical file block size) for filesystems is 128k and ZFS already works exactly like you're asking for on reads or writes that are smaller than the recordsize of a file was created with.

No it doesn't; recordsize is the maximum size of a single record, smaller records can be stored at a smaller size but they are still single, individual records (either a complete file, or part of a single file), which is compressed and stored in isolation, there is no grouping or shared compression between them.

ZVOLs can have much greater compression gains because ZFS is almost always provided with volblocksize of data to compress since it has no awareness of what's inside (could be part of a single file, could be many small files).

So this is in no way existing functionality, except in the sense that ZVOLs already do this using an entire secondary filesystem on top (with additional inefficiencies of its own).

I think perhaps you're confusing the fact that records can be smaller than recordsize with them being grouped together; just because you could, for example, store four 32k records in the same space as a single 128k record, is not the same as grouping them as one record – ZFS still handles them as four separate records. As each of those 32k records was compressed, and encrypted, separately.

This proposal is that the four 32k records would be combined into a single 128k record, then compressed, encrypted etc. as a 128k unit. This means that to read one of them back requires loading the entire 128k (or smaller, after compression) record and extracting part of it, so it still involves an extra step, as a ZVOL does, but without an entirely separate filesystem on top, having to mount it separately etc., and with greater awareness of when a grouped record needs to be rewritten (as a ZVOL block may be full of holes that ZFS is unaware of, and can't do anything about, and is always written out as a complete new unit, whereas a grouped record wouldn't always need to be if only some of the sub-records changed).

GregorKopka commented 9 months ago

The logical block size for compression to work on is set using recordsize for filesystems (new files inherit this on creation) and volblocksize for zvols (at creation time of the volume, can not be changed afterwards). Only the last block of a file can be smaller than the recordsize inherited from the filesystem at file creation.

A partial record (smaller than recordsize) write to a file will do a read/modify/write cycle on the affected record, this is also true for the last (potentially only partly filled) record of a file.

Compression for data (regardless of it being stored in volumes or files) uses the same codepath. What you propose is already there.

Please read up on how ZFS stores data on-disk, https://www.giis.co.in/Zfs_ondiskformat.pdf could help, before beginning to think about in what way I might be confused.

Haravikk commented 9 months ago

A partial record (smaller than recordsize) write to a file will do a read/modify/write cycle on the affected record, this is also true for the last (potentially only partly filled) record of a file.

Again, this is not what the proposal is requesting; what you're describing is the read/modify/write cycle of a single discreet record (up to recordsize), i.e- when you access part of a file you read the corresponding record, but that's not what I'm asking for here at all. I'm not sure how else to explain it to make this clearer to you; I know how ZFS handles files.

I've already set out the reasoning for this proposal as clearly as I can in the proposal itself; volblocksize is an effective minimum record size (except in rare cases where the volume block isn't "complete" yet), whereas recordsize is a maximum record size, this is why the compression performance can vary, as ZVOLs containing a lot of smaller files are able to compress in much larger "blocks" as multiple files can be contained in a single volblocksize of space.

To try and make this clearer; let's say you set volblocksize to 1M, that means 1 megabyte of data will be stored in most volume blocks, while that data could be a 1 megabyte chunk from a single file (within the secondary filesystem), it could also be sixteen different 64 kilobyte files. If those same sixteen files were stored in a ZFS dataset they would each consist of a single 64 kilobyte record (plus metadata), and each record will be individually compressed, meaning the maximum amount of data available for the compression algorithm to work with is 64 kilobytes.

By comparison, those same sixteen 64 kilobyte files stored in a 1 megabyte volume block are compressed as a 1 megabyte chunk of data (since ZFS treats this as a single record) meaning the compression algorithm has up to 1 megabyte of data to work with. Since compression algorithms typically work better the more (compressible) data they receive, this will lead to much bigger savings. For example, if those 64 kilobyte files are all related XML files, there will be substantial compression savings to make around the structure text (XML tags) that the files have in common, which isn't possible when they're compressed individually.

Now this works inconsistently for volume blocks, since ZFS has no control over the contents; each block could contain large or small files, a mix of compressible and uncompressible data etc., but by creating "group" records ZFS has total control, so it could avoid grouping records for which there is no benefit to grouping, and larger records can be ignored entirely.

amotin commented 9 months ago

As I understand, you are asking for ability to store multiple unrelated logical blocks in one physical blocks. As result, multiple logical blocks would receive almost identical block pointers, including the same DVAs, checksums, etc, but some different offsets within that physical blocks. It does not look impossibly difficult for writing and reading, but becomes much more problematic on delete -- you'd need some reference counter, that should be possible to modify each time logical block is freed to free the physical block when all the logical blocks are freed. In many cases you may get many partial blocks that can not be freed, that would kill any space benefits you likely get from better compression. In case of partial logical block rewrite you would have to create a new physical blocks for it and leave a hole in the previous physical blocks, since you may not be able to rewrite the old physical block in place, and you can not update all the logical block pointers to the new location since ZFS does not have a back pointers to know what file(s) use the specific physical block.

Haravikk commented 9 months ago

It does not look impossibly difficult for writing and reading, but becomes much more problematic on delete -- you'd need some reference counter, that should be possible to modify each time logical block is freed to free the physical block when all the logical blocks are freed.

I suggested a grouprecordrewrite property in the proposal as a way to set when a group record should be re-created, basically if it contains more than a certain amount in freed space (though maybe a percentage would be better?).

It's much the same problem ZVOLs experience, except that in that case ZFS has no knowledge of the contents so doesn't know if a volume block is fully utilised, or is full of unused holes, as it only really knows when the entire block is freed (TRIM'ed?) at which point it can discard it (unless a snapshot references it). So as the ZVOL's own guest filesystem becomes fragmented ZFS suffers the same, it just doesn't know it.

In the group record case ZFS can at least be aware of the holes, so it can do something about it when they become wasteful, and we should be able to tune accordingly (make it more or less aggressive about replacing underutilised records.

In case of partial logical block rewrite you would have to create a new physical blocks for it and leave a hole in the previous physical blocks, since you may not be able to rewrite the old physical block in place

That's the idea, a new group record would be written out for any data in the old group record that's being retired, alongside any new writes taking place at the same time (basically the old sub-records are grouped with new ones to create a new group record).

Since ZFS currently doesn't support block pointer rewrite, this will likely mean recreating the individual records as well. In this sense this re-create/defragment operation behaves like a new copy of everything, and the fragmented old record(s) are tidied up as the references expire, same as if you copied manually.

Over time a lot of fragmentation could result in leftover group records holding a noticeable amount of space they don't need, but this is true of ZVOLs now, and is why I wouldn't suggest this become a feature enabled by default. It's intended more for datasets where content is known to contain a high volume of generally compressible small files (either exclusively, or in a broad mix).

There is also the possibility of adding defragmentation of group records to scrubs (similar to #15335) to perform rewriting of less fragmented group records periodically to reclaim space. I don't think it's critical that this be added immediately, but it's a good long term feature to clean up the holes.

GregorKopka commented 9 months ago

volblocksize is an effective minimum record size (except in rare cases where the volume block isn't "complete" yet),

No. volblocksize is a fixed logical record size for a volume. There can be, by definition, no "incomplete blocks" in a volume.

whereas recordsize is a maximum record size

No. recordsize is the fixed logical record size for all blocks of a file, sans the last one (which can be partly filled).

I know how ZFS handles files.

🤔

The functionality you ask for already exists, via backing whatever filesystem with a volume with a high volblocksize, which also has the benefit of neither needing BPR, nor more code (that does not grow on trees).

Haravikk commented 9 months ago

No. recordsize is the fixed logical record size for all blocks of a file, sans the last one (which can be partly filled).

…meaning it's a maximum record size, because anything that doesn't fill recordsize results in a smaller record. You know, like how a maximum size works in literally any situation?

Look, it's clear I lack the skills to explain this feature in a way you are capable of understanding, or more accurately willing to understand, as your response to being told you've misunderstood is to double down on being wrong.

This may seem harsh, but this is far from the first time you have done this, so you are no longer eligible for any benefit of the doubt from me, if you can't be bothered learning what the proposal is, then don't waste everybody's time responding to it.

GregorKopka commented 9 months ago

Your idea is to save on-disk space for storing loads of small files by squashing data from multiple, unrelated files together, in the unfounded hope that a bigger compression window might be able to find more redundancy.

To present why this would be a good idea you not only misrepresent the built-in defaults for recordsize and volblocksize to end up with a higher compression ratio for volumes, make irrational claims (like volumes not knowing what's inside of the data they store, "thus they're better at compression", which is implying that filesystems would know) and make up stuff like

volblocksize is an effective minimum record size (except in rare cases where the volume block isn't "complete" yet), whereas recordsize is a maximum record size,

on the way, completely ignoring corrective input from others while having the audacity to state that these are stupid and/or evil

Look, it's clear I lack the skills to explain this feature in a way you are capable of understanding, or more accurately willing to understand, as your response to being told you've misunderstood is to double down on being wrong.

(emphasis mine), while blissfully ignoring how ZFS structures and adresses data on-disk and guarantees data integrity (which would all need to be touched), to finally end up trying to patch one of the most obvious shortcomings of your idea to save space by suggesting this gem:

Since ZFS currently doesn't support block pointer rewrite, this will likely mean recreating the individual records as well. In this sense this re-create/defragment operation behaves like a new copy of everything, and the fragmented old record(s) are tidied up as the references expire, same as if you copied manually.

Which basically boils down to inflating the storage space required, which is the direct opposite of what you're trying to achieve.

🤦‍♂️

Back to the drawing board, please, feel free to come back when you have an idea that does not involve a full rewrite of DMU, SPA and dataset layer, ending in a backward incompatible on-disk format change, and requires the existence of BPR to not make things worse.

Haravikk commented 9 months ago

Your idea is to save on-disk space for storing loads of small files by squashing data from multiple, unrelated files together, in the unfounded hope that a bigger compression window might be able to find more redundancy.

It's not an "unfounded hope" it's a provable benefit; I have literally talked about why ZVOLs see greater compression gains than datasets do for the exact same files. This is not some theoretical fantasy I've plucked out of nowhere.

All of this is in the proposal you pretend to understand when you clearly still don't.

To present why this would be a good idea you not only misrepresent the built-in defaults for recordsize and volblocksize to end up with a higher compression ratio for volumes, make irrational claims (like volumes not knowing what's inside of the data they store, "thus they're better at compression", which is implying that filesystems would know) and make up stuff like

I have done no such thing, as I haven't "misrepresented" any of this; so now you not only don't understand the proposal, you're trying desperately to pretend that I've lied about basic concepts when I've done nothing of the sort.

I mentioned the recordsize as the maximum size of a record, because it is. This is a simple, basic fact in ZFS.
I mentioned the volblocksize is the effective minimum size of a "record" when compressing with ZVOLs, because it is. Again, another simple, basic fact.
I mentioned that ZFS lacks awareness of how a filesystem is storing files in a ZVOL, because it is. A ZVOL is a block device, ZFS is storing blocks, it doesn't know what those blocks represent, how data is structured within them etc., it only knows what it was asked to write and where. Again, another simple, basic fact.

You got caught having misunderstood the proposal, and instead of apologising, going back to give it another look, and coming back in good faith, you've gone out of your way to perform mental gymnastics in a desperately pathetic attempt to try and cover your mistake.

on the way, completely ignoring corrective input from others while having the audacity to state that these are stupid and/or evil

Your input wasn't "corrective"; you claimed the feature being described already exists in ZFS when it doesn't, you were completely and fundamentally wrong about everything you said, and even worse you demanded that I close the issue, so you have been hostile from your first comment.

And when I pointed out that you seemed to have misunderstood you became belligerent and insulting, exactly as you have on every other issue you've butted into without bothering to read it properly first.

(emphasis mine), while blissfully ignoring how ZFS structures and adresses data on-disk and guarantees data integrity (which would all need to be touched), to finally end up trying to patch one of the most obvious shortcomings of your idea to save space by suggesting this gem:

Now you're making mutually exclusive statements! Either I don't understand how ZFS stores files, or I do, you can't accuse me of both and expect both to stick.

I know full well how ZFS stores files, and I have been completely up-front about the limitations and requirements of the feature from the start, which again, you would know if you'd ever read it in the first place.

I set out how it could cope with fragmentation as sub-records are updated, and I set out the performance trade-offs this, and the feature in general, would entail. I have been completely up front with how it would need to operate, I have made no effort to hide it from anyone. So you can take your accusations of dishonesty and go fuck yourself with them.

My stated aim is literally to gain some of the incidental benefits of ZVOLs into datasets themselves, which means that comes with similar drawbacks. I've covered all of this already, which you would know if you'd read it.

Once again, if you can't even be bothered to read a proposal, then don't bother to reply to it.

GregorKopka commented 9 months ago

I mentioned the recordsize as the maximum size of a record, because it is. This is a simple, basic fact in ZFS.

I mentioned the volblocksize is the effective minimum size of a "record" when compressing with ZVOLs, because it is. Again, another simple, basic fact.

It's just that recordsize and volblocksize define the logical block sizes ZFS compression works on, there is no minimum or maximum about them except possibly the last, partly filled block of a file. Which is among the things I try to point out, for quite some time.

you were completely and fundamentally wrong about everything you said. ... You got caught having misunderstood the proposal, and instead of apologising, going back to give it another look, and coming back in good faith, you've gone out of your way to perform mental gymnastics in a desperately pathetic attempt to try and make me out to be some kind of malicious psychopath. ... So you can take your accusations of dishonesty and go fuck yourself with them. ... So once again; if you can't be bothered reading proposals, don't fucking reply to them with your bad-faith trolling bullshit.

A best-of of the ad hominem from the mail notification and the edited version above.

Haravikk commented 9 months ago

It's just that recordsize and volblocksize define the logical block sizes ZFS compression works on

Which is exactly what this proposal is about; volblocksize sets an effective minimum amount of compression data, and recordsize sets an effective maximum, because in the latter case smaller files produce smaller records and don't compress as well, compared to ZVOL blocks which always have volblocksize data to work with.

You know, exactly what I said in the first place, which you'd know if you'd read it.

A best-of of the ad hominem from the mail notification and the edited version above.

None of these are ad hominem, as I'm merely referring to what you yourself have done here. You have lied, twisted and accused, all to cover your initial mistake, which you refuse to acknowledge or apologise for.

Literally your first comment was to completely misunderstand what was asked for, insisting the feature exists when it doesn't, and to demand the issue be closed. I could have responded with hostility then, and am very much now of the opinion I should have, but instead I gave the benefit of the doubt, and pointed out that you had misunderstood.

Your response? To double down on false statements of increasing irrelevance to the topic in a desperate attempt to avoid admitting you were wrong, and to become increasingly insulting towards me with accusations of stupidity, misrepresenting things when I haven't, and of some kind of comprehensive conspiracy.

But if you'd like ad hominen; you are either a toxic malicious coward or a toxic feckless idiot. Feel free to print that out and frame it if you like, but whatever you do with it, make sure it involves fucking right off at the same time you worthless troll.