Content/file addressed zfs dedupe layer

zenaan commented 4 years ago

A zfs content addressing layer, not too dissimilar to git's content addressing, is a simple hero tier plugin enhancement for zfs filesystems.

It is simple since Git provides a well proven, and simple, content addressing layer design.

It is hero tier for the hopefully obvious reason that a content addressed filesystem is naturally de-duplicating with close to zero RAM overhead.

The basic design, as demonstrated by the many re-implementations of the Git content storage model, is simple to implement, at least for a trivial implementation, and in this case of ZFS, other features including compression, do not need to be added, since zfs already optionally does compression at the block layer.

To briefly hint at how content addressing works, the content of a file to be stored in the filesystem is "hashed", and this hash must be known to read the file back. To make this usable for the end user, a very simple map from "end user filename, to content hash" is maintained. This map could readily be a simple hierarchy of directories to say 4 layers, where each layer is the next char of the hash, and the bottom layer contains the balance of the chars of the file's hash.

In order to mostly remove the extra hashing this might imply, and since ZFS already hashes every block for checksum/ verification, simply use this same hash, and e.g. XOR all the existing block hashes, to create the file's "content address" or "content hash" - there's simply no point adding a whole new layer of otherwise unneeded content hashing.

This may in most cases where it's useful, supplant the need for the existing zfs dedupe layer.

This zfs "git" backend would be a good first project for someone wanting to ease into C language programming and to get up to speed with zfs internals.

See also:

snajpa commented 4 years ago

Maybe I'm missing something, but in case of Git, you actually are relying on the filesystem, to provide you with the resolution of where is this hash stored on disk, where should the drive read from to get the data.

If we transfer the concept down onto ZFS, it would still have to keep the index of the hashes, not dissimilar to the dedup tables already in existence.

Or have I missed something? Maybe I didn't get the basics right here...

zenaan commented 4 years ago

On Fri, Jul 10, 2020 at 03:28:35PM +0000, Pavel Snajdr wrote:

Maybe I'm missing something, but in case of Git, you actually are relying on the filesystem, to provide you with the resolution of where is this hash stored on disk, where should the drive read from to get the data.

If we transfer the concept down onto ZFS, it would still have to keep the index of the hashes, not dissimilar to the dedup tables already in existence.

Or have I missed something? Maybe I didn't get the basics right here...

The correct analogy is this: you are relying on the filesystem for at least two steps (more like three):

initial lookup is "snapsot" name (in git it's "change id"), by default this is "the head of the tree" or the default or latest version of all the files in this filesysem
next layer in this snapshot (or "most recent") is the filename, which gets you an "inode"
without de-dupe or content addressing, the next step is usually a list of disk blocks for the file, identified by the inode
with content addressing we have an extra layer: instead of the filename giving us an inode, we get a hash, which we must use to do an extra lookup (think of this hash as a fancy new type of filename)
this second lookup gets us the inode / actual blocklist

There are plenty links on the web about the design and implementation of git. I am a user of git not its developer.

The extra layer / backend that git provides (of content addressing) could be implemented with something other than a filesystem, such as a database or direct block storage (raw disk access) etc. It's just an extra lookup layer, and quite simple at that. You could also think of it like gluster or nfs or a fuse or other loopback filesystem running locally on top of a local filesystem, again it's just an extra layer of indirection.

Good luck,

shodanshok commented 4 years ago

I'm not familiar with the internal of git, but if I understand your proposal correctly it would dedup identical files only. While it is good in itself, a more general block-level dedup approach (the one currently implemented in ZFS) seems superior.

For a fast & cheap block-level dedup, one has to see no further than vdo. It would be great if ZFS would implement something similar.

For file-level dedup, I think the correct long-term solution would be to support reflink and let the user trigger dedup by simply searching for identical files and reflinking them.

zenaan commented 3 years ago

On Wed, Jul 15, 2020 at 12:56:22AM -0700, shodanshok wrote:

I'm not familiar with the internal of git, but if I understand your proposal correctly it would dedup identical files only. While it is good in itself, a more general block-level dedup approach (the one currently implemented in ZFS) seems superior.

It depends on what is being stored, e.g a VM hosting provider may deploy VMs as images (apparently very common), and so then yes, you may well want block-level decomposition of the image, and deduplication so that only changed blocks for each VM are treated as actually different.

This may always be a useful operating mode, and so one can imagine a "file type dedupe" plugin, which treats some files as plain files, and others as "block level dedupe".

In this case, the git style content addressing design can still apply and may well still be a big win, not only for "normal or small etc" files which are treated as single "object of content" for addressing purposes, but the principle of the content addressing layer being "similar to a symlink, only it's a $HASH of some sort" as an indirection layer - the extra lookup(s) that such an indirection layer necessarily (in any design) introduces is in a directory/link (HASH) lookup, and therefore does not need the RAM heavy "all blocks at all times lookup" mechanism that the existing ZFS dedupe implementation imposes on all its users.

In any case, such a "break down these particular files into per-block" plugin, can be added later as an enhancement.

Further, it is quite conceivable that if the simple per-file implementation is as effective and "RAM inexpensive" as this superficial view imagines, it may be that VM images, which are "just a bunch of files" at heart, may be able to be deployed as "just a bunch of files", literally, rather than as an image, or we could say that with an appropriate file/IO plugin for your VM, a VM's files could exist as actual files on the host, and therefore take advantage of the simple per-file HASH indirection layer - this would arguably provide better dedupe for small files or file parts which some filesystems would otherwise compress into blocks which would vary over time.

I am unaware if in this situation, there are still circustances which could benefit from "break a file down into blocks" for dedupe, although theoretically this may still be useful.

For a fast & cheap block-level dedup, one has to see no further than vdo. It would be great if ZFS would implement something similar.

For file-level dedup, I think the correct long-term solution would be to support reflink and let the user trigger dedup by simply searching for identical files and reflinking them.

Indeed. It may be that the "best" implementation of reflinks is with a Git styled "content addressing" layer.

This is an important point - no point doing content addressed dedupe and NOT providing reflinks - Bonus! :D

jittygitty commented 2 years ago

@zenaan @shodanshok During my research on my issue at #13349 I just came upon your discussion at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tfa22fbf65c5411f0 Where @zeenan said: "Even more reason to use a content addressed map layer (on top of zfs' existing block etc backend) for dedupe :)

To make the content addresses require minimum cpu overhead, simply XOR the block checksums that zfs already must calculate, and voi la, "free" dedupe."

And I thought hmm what you said seems similar to my thoughts on leveraging Linux fiemap and existing Linux reflinks code to implement cp --reflink and offline-dedupe etc.

zenaan commented 2 years ago

Implementing a content addressed map layer (on top of zfs' existing block etc backend) for dedupe, which would open up a world of git-like possibilities to boot, has to be conceptually one of the simplest sigma-/god- tier enhancements possible - for anyone wanting a quick start to ego boosting stardom in the FS/kernel world :)

jittygitty commented 2 years ago

Implementing a content addressed map layer (on top of zfs' existing block etc backend) for dedupe, which would open up a world of git-like possibilities to boot, has to be conceptually one of the simplest sigma-/god- tier enhancements possible - for anyone wanting a quick start to ego boosting stardom in the FS/kernel world :)

I was afraid to say that out loud! But after 12 days of banging my head torturing myself to pour through code and patches all way from 2008 to now, I was starting to come to that conclusion also. Which raised some uncomfortable questions, as to the reason why these features that the "community" has been begging and crying for all over the internet for the past TWELVE YEARS have seemingly gotten very little serious attention from the project contributors/leaders etc. If you read my #13349 issue you'll see I quoted someone at Phoronix thread that accused openzfs of refusing to give the "real reason", which they claimed was openzfs fear of license incompatibility, which I didn't think was the reason. But my ticket and my questions have been ignored by the leadership so far, of course they might just be busy so I'll have to be a little patient. Yet if my questions keep getting ignored, sadly I'll have to conclude the Phoronix guy was mouthing off some conspiracy that would turn out true.

Personally I thought the reason for lack of progress given that these features should be very doable on Linux was that most of the development was driven by companies working with illumos kernels or BSD and the fact that it would be easy to do on LINUX and not on their OS systems meant that they weren't going to get paid by the companies they worked for to work on it.

I plan to post in #7545 and https://github.com/openzfs/zfs/pull/9554 to see if this implementation they were trying to do, if it has had to do extra work/workarounds around any gpl-only exports ie export_symbol_gpl or if they didn't have any of those.

Because my questions in https://github.com/openzfs/zfs/issues/11357 haven't gotten any reply from anyone for twelve days. But again I'm trying to be patient since hey maybe "everyone" has just been too busy and they just need a few reminders etc.

zenaan commented 2 years ago

tl;dr: Volunteers be volunteerin', so give 'em gratitude, not disdain.

"Which raised some uncomfortable questions, as to the reason why these features that the "community" has been begging and crying for all over the internet for the past TWELVE YEARS have seemingly gotten very little serious attention from the project contributors/leaders"

There is no conspiracy, let's be clear on this!

What we do have is a bunch of mostly volunteers doing what they can when they can, to improve things for themselves and others.

In this word, except that "I have money to plonk on the table and pay to get exactly A, B and C implemented, pronto", then I have no leg to stand on.

We can't tell volunteers what to do.

OZFS is a community of mostly helpful people, not "a dictatorship people are trying to escape".

In such a "do"-ocracy, the only relevant question is "what can I do in the next few days or months". If you can answer that with "implement this cool new feature", then that's great, and in this particular case of the "content/file addressed ozfs layer" feature suggestion, I was merely highlighting the concept, as well as the (IMSEHO) conceptual (at least) simplicity of this feature for any budding rock stars.

(Oh and btw, if you are personally up for the coding, you just work in your own fork/branch and invite the world to look-see when you have something worth looking and seeing at - and if it works as well and as simply as it should, based on the conceptual analysis above, then the feature should take off like wildfire.)

Ok ok, off my high horse - hopefully this is now excessively clear :)

On 5/1/22, jittygitty @.***> wrote:

Implementing a content addressed map layer (on top of zfs' existing block etc backend) for dedupe, which would open up a world of git-like possibilities to boot, has to be conceptually one of the simplest sigma-/god- tier enhancements possible - for anyone wanting a quick start to ego boosting stardom in the FS/kernel world :)

I was afraid to say that out loud! But after 12 days of banging my head torturing myself to pour through code and patches all way from 2008 to now, I was starting to come to that conclusion also. Which raised some uncomfortable questions, as to the reason why these features that the "community" has been begging and crying for all over the internet for the past TWELVE YEARS have seemingly gotten very little serious attention from the project contributors/leaders etc. If you read my #13349 issue you'll see I quoted someone at Phoronix thread that accused openzfs of refusing to give the "real reason", which they claimed was openzfs fear of license incompatibility, which I didn't think was the reason. But my ticket and my questions have been ignored by the leadership so far, of course they might just be busy so I'll have to be a little patient. Yet if my questions keep getting ignored, sadly I'll have to conclude the Phoronix guy was mouthing off some conspiracy that would turn out true.

Personally I thought the reason for lack of progress given that these features should be very doable on Linux was that most of the development was driven by companies working with illumos kernels or BSD and the fact that it would be easy to do on LINUX and not on their OS systems meant that they weren't going to get paid by the companies they worked for to work on it.

I plan to post in #7545 and https://github.com/openzfs/zfs/pull/9554 to see if this implementation they were trying to do, if it has had to do extra work/workarounds around any gpl-only exports ie export_symbol_gpl or if they didn't have any of those.

Because my questions in https://github.com/openzfs/zfs/issues/11357 haven't gotten any reply from anyone for twelve days. But again I'm trying to be patient since hey maybe "everyone" has just been too busy and they just need a few reminders etc.

-- Reply to this email directly or view it on GitHub: https://github.com/openzfs/zfs/issues/10552#issuecomment-1114032511 You are receiving this because you were mentioned.

Message ID: @.***>

jittygitty commented 2 years ago

@zenaan Hey if you read my other post you'd see I thought the phoronix post was not true, and that nobody was conspiring to hide the real reason as he was saying, but in my opinion it was simply that contributors had "other priorities" and maybe not "LINUX" but BSD or illumos kernel based distributions, and if a feature was ONLY EASY to do on "LINUX" and "NOT" on the other distributions, that naturally they would be working on their own distributions instead, regardless if a feature would be easier to do on Linux. That said, if I never do get some responses to my simple questions like in https://github.com/openzfs/zfs/issues/11357 then and only then might I start to believe maybe I was "wrong" and perhaps somehow somebody thinks that the licensing issue with Linux is the reason why some things that should be easier on Linux than on all the other distros, were not done etc.

I've always been grateful for everyone's contributions, and I even respect their wish to work on the distributions important to them, even if its not "Linux".

Anyway not sure if you've seen my recent post here for Crowd Funding issues/features etc: https://github.com/openzfs/zfs/issues/13397

openzfs / zfs

Content/file addressed zfs dedupe layer #10552