openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.65k stars 1.75k forks source link

Feature request for file based copy on write #3020

Open pavel-odintsov opened 9 years ago

pavel-odintsov commented 9 years ago

Hello, folkes!

I walking through rich features of ZFS and it perfectly fits to my container tasks.

But I haven't find one very important feature. I want ability to use one file in multiple places until it changed somewhere.

I want to create read only volume ("template" in OpenVZ terms) with bunch of binary files representing root hierarchy of Debian Wheezy. After that I want to create hundreds of volumes for customers with file hierarchy completely identical to "template". So every /usr/bin/apache2 for every customer will have identical content, inode and will be stored in filesystem only once. If customer want to change or remove /usr/bin/apache2 it "unlinks" from "template" and works like standard file. This approach will optimize system performance, prevent double-caching and provide much more space without very costly deduplication.

If you still not understand me I want something like fork behavior in Linux when child memory not allocated really until it touched.

Something like this was realized in VServer (http://linux-vserver.org/util-vserver:Vhashify) as patch for ext4. This approach used in Parallels Virtuozzo vzfs (http://download.swsoft.com/virtuozzo/virtuozzo4.0/docs/en/lin/VzLinuxUG/209.htm and http://www.montanalinux.org/openvz-kir-interview.html ) too but was very buggy and hard to maintain. But they want do this task on non-cow filesystem and it's was not a good idea.

THis feature can provide following benefits in compare with "copy this template 100 times":

According to my understanding of ZFS internal this feature can be realized in reliable way on ZFS. And every user of VM or containers (with haystacks of identical binary files) will be very pleased by this feature!

Thank you!

lkateley commented 9 years ago

What you are looking for is very easily done with snapshot and clone.

I have a little video on how to.. on http://kateleyco.com/?page_id=783

On 1/16/15 9:21 AM, Pavel Odintsov wrote:

Hello, folkes!

I walking through rich features of ZFS and it perfectly fits to my container tasks.

But I haven't find one very important feature. I want ability to use one file in multiple places until it changed somewhere.

I want to create read only volume ("template" in OpenVZ terms) with bunch of binary files representing root hierarchy of Debian Wheezy. After that I want to create hundreds of volumes for customers with file hierarchy completely identical to "template". So every /usr/bin/apache2 for every customer will have identical content, inode and will be stored in filesystem only once. If customer want to change or remove /usr/bin/apache2 it "unlinks" from "template" and works like standard file. This approach will optimize system performance, prevent double-caching and provide much more space without very costly deduplication.

If you still not understand me I want something like fork behavior in Linux when child memory not allocated really until it touched.

Something like this was realized in VServer (http://linux-vserver.org/util-vserver:Vhashify) as patch for ext4. This approach used in Parallels Virtuozzo vzfs (http://download.swsoft.com/virtuozzo/virtuozzo4.0/docs/en/lin/VzLinuxUG/209.htm and http://www.montanalinux.org/openvz-kir-interview.html ) too but was very buggy and hard to maintain. But they want do this task on non-cow filesystem and it's was not a good idea.

THis feature can provide following benefits in compare with "copy this template 100 times":

  • Very big disk space save
  • Lower load on I/O because we should copy only meta data if files and do not touch files contents
  • Lower memory usage because we cache this files only once
  • Faster operation time then direct call of: |zfs send debian-template@11022014 | zfs receive client-container-42|

According to my understanding of ZFS internal this feature can be realized in reliable way on ZFS. And every user of VM or containers (with haystacks of identical binary files) will be very pleased by this feature!

Thank you!

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3020.

pavel-odintsov commented 9 years ago

Linda, thank you so much! I found 8 videos on this pages. Could you clarify what video for me?

lkateley commented 9 years ago

there is one just about snapshot and clone. If you snap a file or dataset... it becomes a read only filesystem. Then you can clone it make an identical read write version. The blocks will point back to original blocks.

On 1/16/15 9:41 AM, Pavel Odintsov wrote:

Linda, thank you so much! I found 8 videos on this pages. Could you clarify what video for me?

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3020#issuecomment-70271370.

lkateley commented 9 years ago

I know that you want it just on a single file, but those can be done through links too.. snap and clone work well on a template of files in a directory.

On 1/16/15 9:41 AM, Pavel Odintsov wrote:

Linda, thank you so much! I found 8 videos on this pages. Could you clarify what video for me?

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3020#issuecomment-70271370.

pavel-odintsov commented 9 years ago

Thank you again!

It's really what I want! I very appreciate your help!

Clones A clone is a writable volume or file system whose initial contents are the same as another dataset. As with snapshots, creating a clone is nearly instantaneous, and initially consumes no additional space.

Clones can only be created from a snapshot. When a snapshot is cloned, it creates an implicit dependency between the parent and child. Even though the clone is created somewhere else in the dataset hierarchy, the original snapshot cannot be destroyed as long as a clone exists. The origin property exposes this dependency, and the destroy command lists any such dependencies, if they exist.

The clone parent-child dependency relationship can be reversed by using the promote subcommand. This causes the "origin" file system to become a clone of the specified file system, which makes it possible to destroy the file system that the clone was created from.

pavel-odintsov commented 9 years ago

Yes, zfs clone can solve initial part of my issue. I can create containers/vm's without copying same data multiple times.

But what about ability to relink dataset to another template like it realized in VServer http://linux-vserver.org/util-vserver:Vhashify ?

I want it for following case:

Is it possible?

pavel-odintsov commented 9 years ago

I'm deeply thinking about another approach used in OpenVZ pfcache http://wiki.openvz.org/Pfcache/API

It's very interesting approach for de-duplication of binary and library files in memory.

They process /usr/bin, /usr/sbin, /bin, /sbin in every container disk and generate sha1 checksumm for it. After that they store sha1 in special xattr field (trusted.pfcache).

As next step they build lookup table for identifying uniq files. When all uniq binary files found they copied to /vz/pfcache folder with names builded from sha1 chksumms and original files replaced with links to files in this folder.

This approach provide following features:

is is possible do something like with with ZFS natively?

lkateley commented 9 years ago

This sounds like you can use rollback of a snapshot..

On 1/16/15 10:13 AM, Pavel Odintsov wrote:

Yes, zfs clone can solve initial part of my issue. I can create containers/vm's without copying same data multiple times.

But what about ability to relink dataset to another template like it realized in VServer http://linux-vserver.org/util-vserver:Vhashify ?

I want it for following case:

  • I installed container with Debian 7.0.1
  • Customer upgraded Debian 7.0.1 to 7.0.2 manually
  • Links to original template snapshot will be broken and I want to link container to Debian 7.0.2 template again.

Is it possible?

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3020#issuecomment-70276974.

gordan-bobic commented 9 years ago

Clones, snapshots and deduplication are NOT equivalent to this. Please stop suggesting it is close enough because it really, really isn't. There is a major difference, and additional inconveinences with cloning snapshots.

1) CoW hard-link breaking would mean the inodes are the same. This is extremely important because it means the mmap() of the binary will result in only one in-memory copy, no matter how many different chroots mmap() different hard-links of it. This means a massive saving in memory for hosts running VServer, LXC or OpenVZ (or jails on FreeBSD, or any other similar feature on an OS that supports ZFS).

2) Clones are grandfathered. Once you have cloned a file system you cannot delete it. You have to keep it as long as any clones based on it exist - even if every last block has changed. Clones are also not rebasable.

Also, unlike with memory deduplication being done by a separate subsystem, CoW hard-link breaking would make deduplication free at the point of consumption. You hard-link the files periodically, and after than the deduplication of memory is implicit and completely free.

CoW hard-link breaking is a feature that simply cannot be meaningfully approximated using the existing features.

pavel-odintsov commented 9 years ago

Thank you for detailed explanation, Gordan Bobic!

"Clones are grandfathered. Once you have cloned a file system you cannot delete it. You have to keep it as long as any clones based on it exist - even if every last block has changed. Clones are also not rebasable."

We could solve this issue with zfs promote command which detach clone from original template and made it stand-alone volume. But since we did promote we can't got any benefits....

gordan-bobic commented 9 years ago

Indeed, hence my remark that a clone cannot be rebased.

Ideally we want to set a ZFS option, e.g.:

zfs set cowhardlink=1 pool/fs

which will subsequently cause any open() of any file on that for writing (but not for reading) to get copied if refcount for inode > 1.

In pseudo code, something along the lines of the following:

if (refcount > 1 && (mode == O_WRONLY || mode == O_RDWR)) { // copy entire file and return file handle to the new copied file }

pavel-odintsov commented 9 years ago

Yep, rebase is absolutely impossible for zfs/clone :(

pavel-odintsov commented 9 years ago

Issue https://github.com/zfsonlinux/zfs/issues/405 and thread https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/mvGB7QEpt3w will be useful for everyone interested in this feature.

gordan-bobic commented 9 years ago

While there may be some functionality overlap, these are really different features, both functionally and semantically. What torn5 was asking for is actually much more similar to using zvols and clones of zvols - it's just that he wanted to use files rather than zvols for his own reasons, valid or otherwise. The CoW hard-link breaking is much more like FL-COW that I mentioned on that thread, only it needs to be implemented at a level below what the guest chroot can control for security reasons.

On Tue, Jan 20, 2015 at 11:15 AM, Pavel Odintsov notifications@github.com wrote:

Issue #405 https://github.com/zfsonlinux/zfs/issues/405 and thread https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/mvGB7QEpt3w will be useful for everyone interested in this feature.

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3020#issuecomment-70638903.