[rfc] OCIv2 implementation

cyphar commented 6 years ago

I have some proposal ideas for the OCIv2 image specification (it would actually be OCIv1.1 but that is a less-cool name for the idea) and they primarily involve swapping out the lower levels of the archive format to better designed (along the same lines as restic or borgbackup).

We need to implement this as a PoC in umoci before it's proposed to the image-spec proper so that we don't get stuck in debates over whether it has been tested "in the wild" -- which is something that I imagine any OCI extension is going to go through.

cyphar commented 6 years ago

As an aside, it looks like copy_file_range(COPY_FR_DEDUP) wasn't merged. But you can use ioctl(FICLONERANGE) or ioctl(FIDEDUPERANGE) (depending on which is the most correct way of doing it -- I think FICLONERANGE is what we want). If it isn't enough we can always revive the patch, as one of the arguments against it was that nobody needed partial-file deduplication -- but we need this now for OCIv2 to have efficient deduplicated storage.

cyphar commented 6 years ago

FICLONERANGE needs to be block-aligned (unsurprisingly) but unfortunately the block alignment is for both source and destination. This means that if we have different block sizes which are out-of-alignment of the filesystem block size we will have very few alignments.

On the plus side, for small files we can just use reflinks.

cyphar commented 5 years ago

Some things that should be tested and discussed:

How bad is the Merkle tree hit? Should each individual file be linked from a map (or a packfile) of some kind to avoid really tall trees? How deep can a normal distribution's filesystem go? Each de-reference can be quite expensive (especially if it involves a pull -- but I would hope that HTTP/2 server push would resolve this somewhat).
What sort of chunk size is optimal?
How should we implement that canonical representation checking? This is something that should be a hard failure when trying to use an image, to avoid incompatible tools from doing something wrong.
As a point of comparison, looking at how much transfer-deduplication gain we can get from content-defined-chunking would be interesting.
Do we need to define a new rootfs type other than layered for this change? Layers are something we should potentially drop -- but maybe we should structure it as a "snapshot" concept in case people still want snapshots.

vbatts commented 5 years ago

like your inspiration from https://github.com/restic/restic, I think there is a good argument that the chunks and content addressable storage ought to be compatible with https://github.com/systemd/casync too.

cyphar commented 5 years ago

I will definitely look into this, though it should be noted (and I think we discussed this in-person in London) that while it is very important for fixed chunking parameters to be strongly recommended in the standard (so that all image builders can create compatible chunks for inter-distribution chunking) I think they should be configurable so that we have the option to transition to different algorithms in the future.

Is there a paper or document that describes how casync's chunking algorithm works? I'm looking at the code and it uses Buzhash (which has a Go implementation apparently) but it's not clear to me what the chunk boundary condition is in shall_break (I can see that it's (v % c->discriminator) == (c->discriminator - 1) but I don't know what that means).

I'm also quite interested in the serialisation format. Lennart describes it as a kind of random-access tar that is also reproducible (and contains all filesystem information in a sane way). I will definitely take a look at it. While I personally like using a Merkle tree because it's what git does and is kind of what makes the most sense IMO (plus it is entirely transparent to the CAS), I do see that having a streamable system might be an improvement too.

cyphar commented 5 years ago

As an aside, since we are creating a new serialisation format (unless we reuse casync) we will need to implement several debugging tools because now you will no longer be able to use tar for debugging layers.

giuseppe commented 5 years ago

I've already talked with @cyphar about it, but I'll comment here as well so to not lose track of it. The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately.

I've played a bit with FIDEDUPERANGE here: https://github.com/giuseppe/containers-dedup

flx42 commented 5 years ago

@cyphar what was the argument against doing simply file-level deduplication? I don't claim to know the typology of all docker images, but on our side (NVIDIA) we have a few large libraries (cuDNN, cuBLAS, cuFFT) which are currently duplicated across multiple images we publish:

The files are duplicated if you redo a build even if nothing has changed, since it creates a new layer with the same content.
The files are duplicated across CUDA images with different distros: the same library is shipped for our CentOS 6/7, and Ubuntu 14.04/16.04/18.04 tags.

@giuseppe @cyphar it is my understanding that when deduplicating files/blocks at the storage level, we decrease storage space but the two files won't be able to share the same page cache entry. Is that accurate? Is that an issue that can be solved at this level too? Or will users still need to layer carefully to achieve this sharing?

vbatts commented 5 years ago

@flx42 overlayfs has best approach for reusing page cache, since it's the same inode on the same maj/min device

flx42 commented 5 years ago

@vbatts right, and that's what we use today combined with careful layering. I just wanted to clarify if there was a solution at this level, for the cases where you do have the same file but not from the same layer.

AkihiroSuda commented 5 years ago

The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately.

I think we (at least I) have put focus on registry-side storage & network deduplication.

Runtime-side local deduplication is likely to be specific to runtimes and out of scope of OCI Image Spec & Dist Spec?

cyphar commented 5 years ago

@AkihiroSuda

A few things:

It depends how you define "runtime". If you include everything about the machine that pulls the image, extracts the image, and the runs a container as the "runtime" then you're correct that it's a separate concern. But I would argue that most image users would need to do both pulling and extraction -- so it's clearly an image-spec concern to at least consider it.
Ignoring (or punting on) storage deduplication (when we have the chance to do it) would likely result in suboptimal storage deduplication -- which is something that people want! I would like OCIv2 images to actually replace OCIv1 and if the storage deduplication properties are worse or no better, then that might not happen.

Given that CDC (and separation of metadata, merkle tree or some similar filesystem representation) already solves both "registry-side storage & network deduplication" I think that considering whether it's possible to take advantage of the same features for storage deduplication is reasonable...

cyphar commented 5 years ago

@flx42

what was the argument against doing simply file-level deduplication?

Small modifications of large files, or files that are substantially similar but not identical (think man pages, shared libraries and binaries shipped by multiple distributions, and so on) would be entirely duplicated. So for the image format I think that using file-level deduplication is flawed, for the same reasons that file-level deduplication in backup systems is flawed.

But for storage deduplication this is a different story. My main reason for wanting to use reflinks is to be able to use less disk space. Unfortunately (as I discovered above) this is not possible for variable-size chunks (unless they are all multiples of the chunk size).

Using file-based deduplication for storage does make some sense (though it does naively double your storage requirement out-of-the gate). My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files). Of course, it might be necessary (for fast container "boot" times) to pre-generate the rootfs for any given image -- but benchmarks would have to be done to see if it's necessary.

My main interest in reflinks was to see whether it was possible to use them to remove the need for the copies for the "file store", but given that you cannot easily map CDC chunks to filesystem chunks (the latter being fixed-size) we are pretty much required to make copies I think. You could play with a FUSE filesystem to do it, but that is still slow (though some recent proposals to use eBPF could make it natively fast).

As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode.

cyphar commented 5 years ago

@flx42

It should be noted that with this proposal there would no longer be a need for layers (because the practical deduplication they provide is effectively zero) though I think that looking into how we can use existing layered filesystems would be very useful -- because they are obviously quite efficient and it makes sense to take advantage of them.

Users having to manually finesse layers is something that doesn't make sense (in my view), because the design of the image format should not be such that it causes problems if you aren't careful about how images are layered. So I would hope that a new design would not repeat that problem.

flx42 commented 5 years ago

@cyphar Thanks for the detailed explanation, I didn't have a clear picture of the full process, especially on how you were planning to assemble the rootfs, but now I understand.

As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode.

I found the following discussion on this topic: https://www.spinics.net/lists/linux-btrfs/msg38800.html I was able to reproduce their results with btrfs/xfs, indicating that the page cache was not shared. As you mentioned, the solution could be to hardlink files when assembling the final rootfs instead of reflinking. You would need an overlay obviously, but that means you won't be able to leverage the CoW mechanism from the underlying filesystem (which might be fined grain) and instead rely on copy_up which copies the full file AFAIK.

Not necessarily a big deal, but nevertheless an interesting benefit of layer sharing+overlay that would be nice to keep.

flx42 commented 5 years ago

FWIW, I wanted to quantify the difference with block-level vs file-level deduplication on real data, so I wrote a few simple scripts here: https://github.com/flx42/layer-dedup-test

It pulls all the tags from this list (minus the Windows tags that will fail). This was the size of the layer directory after the pull:

+ du -sh /mnt/docker/overlay2
822G    /mnt/docker/overlay2

Using rmlint with hardlinks (file-level deduplication):

+ du -sh /mnt/docker/overlay2
301G    /mnt/docker/overlay2

Using restic with CDC (block-level deduplication):

+ du -sh /tmp/restic
244G    /tmp/restic

This is a quick test, so no guarantee that it worked correctly. But this is a good first approximation. File-level deduplication performed better than I expected, block-level with CDC is indeed better but at the cost of extra complexity and possibly a two-level content store (block then file).

cyphar commented 5 years ago

Funnily enough, Go 1.11 has changed the default archive/tar output -- something that having a canonical representation would solve. See #269.

cyphar commented 5 years ago

@flx42

You would need an overlay obviously, but that means you won't be able to leverage the CoW mechanism from the underlying filesystem (which might be fined grain) and instead rely on copy_up which copies the full file AFAIK.

Does overlay share the page cache? It was my understanding that it didn't, but that might be an outdated piece of information.

flx42 commented 5 years ago

@cyphar yes it does: https://docs.docker.com/storage/storagedriver/overlayfs-driver/#overlayfs-and-docker-performance

Page Caching. OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file. This makes the overlay and overlay2 drivers efficient with memory and a good option for high-density use cases such as PaaS.

Also a while back I launched two containers, one pytorch and one tensorflow, using the same CUDA+cuDNN base layers. Then using /proc/<pid>/maps on both containers I was able to verify they loaded the same copy of one library (the same inode).

cgwalters commented 5 years ago

My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files).

This is exactly what libostree is, though today we use a read-only bind mount since we don't want people trying to persist state in /usr. (It's still a mess today how in the Docker ecosystem / is writable by default but best practice is to use Kubernetes PersistentVolumes or equivalent). Though running containers as non-root helps since that will at least deny writes to /usr.

cyphar commented 5 years ago

Blog post on the tar issues is up. https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

vbatts commented 5 years ago

And so much conversation on The Twitter

-------- Original Message -------- On Jan 21, 2019, 14:37, Aleksa Sarai [see §317C(6)] wrote:

Blog post on the tar issues is up. https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Toasterson commented 4 years ago

Interesting Discussion on the image proposal based on files. If you want to see a production grade example have a look at the Image Packaging System (IPS) from illumos. Its based originally on the concept to be used as a package manager inside a Container but one can easily leave dependencies out of a manifest and thus create an image layer so to speak. Manifests are also merged ahead of time and so you only need to download what is needed. Additionally, by using text files to encode any metadata one can simply encode any attributes needed later in the spec. I was thinking on extending the server with a registry API so that one can download a dynamically generated tarfile and use the file based storage in the background.

While it has a few pythonisms in it, I made a port of the server and manifest code to golang some time ago. Let me know if any of this is interesting to you I can give detailed insights and information about challenges we stumbled upon in the field in the last 10 years.

Original Python Implementation (In use today on OpenIndiana and OmniOS) https://github.com/OpenIndiana/pkg5 Personal port to golang (server side only atm) https://git.wegmueller.it/Illumos/pkg6

safinaskar commented 1 year ago

So here is my own simplistic parallel casync/desync alternative, written in Rust, which uses fixed sized chunking (which is great for VM images): https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 . You can also see there benchmark, which compares my tool to casync, desync and other alternatives. And my tool is way faster than all them. (But I cheat by using fixed sized chunking). See whole issue for context and especially this comment https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 for comparison between casync, desync and other CDC-based tools

safinaskar commented 1 year ago

Okay, so here is list of Github issues I ~spammed~ wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage. https://github.com/borgbackup/borg/issues/7674 https://github.com/systemd/casync/issues/259 https://github.com/folbricht/desync/issues/243 https://github.com/ipfs/specs/issues/227 https://github.com/dpc/rdedup/discussions/222 https://github.com/opencontainers/umoci/issues/256

ariel-miculas commented 1 year ago

I'm working on puzzlefs which shares goals with the OCIv2 design draft. It's written in Rust and it uses the FastCDC algorithm to chunk filesystems. Here's a summary of the saved space compared to the traditional OCIv1 format. I will also present it at upcoming Open Source Summit Europe in September.

safinaskar commented 1 year ago

@ariel-miculas, cool! Let me share some thoughts.

This thread https://groups.google.com/a/opencontainers.org/g/dev/c/icXssT3zQxE may be of interest. In particular, Sarai said "I am suggesting that a new filesystem would be a good optional way of optimising usage of OCI images" and Greg KH answered "No, never create a new filesystem unless you have 5-10 years to focus exclusivly on it before you can rely on it". I think Greg's words should not be taken too seriously
When I did my benchmark, I saw very strange behavior of another dedupper written in Rust: rdedup ( https://github.com/dpc/rdedup ). rdedup uses fastcdc as its default chunking method. My benchmark ( https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 ) shows that for unknown reasons rdedup becomes x10 slower (!!!), when chunk size changes from 4096K to 64K. This may be some special property of my data or of fastcdc. Or bug in Rust's fastcdc implementation. Make sure you have no this bug
FUSE badly interacts with suspend. See this thread https://lore.kernel.org/lkml/CAPnZJGDWUT0D7cT_kWa6W9u8MHwhG8ZbGpn=uY4zYRWJkzZzjA@mail.gmail.com/ . So make sure that system can be suspended while your fs is mounted. If you see this bug, then, I think, it can be fixed by always using timeouts

ariel-miculas commented 1 year ago

Thanks for your feedback!

Interesting issue with rdedup, puzzlefs is using the fastcdc crate which I've noticed it's not used by rdedup. Some benchmarks with puzzlefs would sure be useful.
I wasn't aware of the FUSE issue with suspend. However, I'm also working on a kernel driver for puzzlefs, see version 1, version 2 and also this github issue.

opencontainers / umoci

[rfc] OCIv2 implementation #256