Proposal: Global Image/Layer Namespace

alex-aizman commented 9 years ago

1. Terms

Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces

2. sha256

Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.

Side note: sha256 covers a space of more than 10 \ 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.

The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.

Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable "asset".

This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:

Global Namespace Global Deduplication

for image layers. Hence, this Proposal.
3. Docker Cluster

Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:

The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:

docker-namespace-federated

Blue arrows reflect actual NFS mounts.

There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.

Note: hyper-convergence

Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute "hyper-convergence". But that's another big story outside this scope..

Note: runtime mounting

As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..

Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both - the latter currently is done for aufs only.

4. Benefits

An orchestrator can run container on an image-less node, without waiting for the image to get pulled
Scale-out: by adding a new node to the cluster, we incrementally add CPU, memory and storage capacity for more docker images and containers that, in turn, can use the aggregated resource
Deduplication: any image or layer that exists in two or more instances can be, effectively, deduplicated. This may require pause/commit and restart of associated containers; this will require reference-counting (next)
5. Comments

It's been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea - makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.

6. What’s Next

The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.

The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.

This is easy but this, as well as the other points below, requires positive feedback and consensus.

Other immediate steps include:

graph.TagStore to tag all layers including remote
rootNFS setting via .conf for Graph
fix migrate.go accordingly

Once done, next steps could be:

on demand mounting and remounting via distributed daemon (likely etcd)
node add/delete runtime support - same
local cache invalidation upon new-image-pulled, image-deleted, etc. events (“cache” here implies Graph.idIndex, etc.)
image/layer reference counting, to correctly handle remote usage vs. ‘docker rmi’ for instance
and more

And later:

shadow copying of read-only layers, to trade local space for performance
and vice versa, removal of duplicated layers (the “dedup”)
container inter-node migration
container HA failover
object storage as the alternative backend for docker images and layers (which are in fact immutable versioned objects, believe it or not).

Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.

7. Instead of Conclusion

In the end the one thing that makes it – all of the above - doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.

thaJeztah commented 9 years ago

ping @stevvooe @dmcgowan (I think)

dmcgowan commented 9 years ago

I agree on the direction of the proposal but we currently have different plans for how to get there. Although we are still trying to plan out what the immediate steps (for Docker 1.9) are related to the graph driver. We want both a significant code refactor to separate more cleanly graph store from tag store and to break down the graph store into an object store and layer store. I could see such a broken out layer interface supporting implementation for clustering based on NFS.

I would love to include you in these discussions as it is very common for code in this area to slip release due to focus being shifted elsewhere. It is becoming more and more a focal point though for distribution related problems. @stevvooe is the right person to continue to discussion with. I would also take a look at https://github.com/docker/blobber which addresses the problem from a different angle.

thaJeztah commented 9 years ago

@dmcgowan I think the https://github.com/docker/blobber is currently "private", because I get a 404 there

dmcgowan commented 9 years ago

Ahh yeah its private, a late night oversight. Thanks for keeping me honest @thaJeztah :smile:. I just wanted to show the design objectives in the README, let me see if we can get that into another document.

stevvooe commented 9 years ago

@alex-aizman What exactly are you proposing and what specific problems does this solve?

From an initial reading, it sounds like the proposal is to integrate NFS mounts into docker image storage to leverage better content sharing. We would likely never require people to configure NFS as part of a docker install. Aside from being a leaky abstraction, NFS is a nasty single point of failure without a lot of caveats and spotty support.

This seems like an interesting operational layout that can be supported by providing a sane path layout under /var/lib/docker.

It's been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea.

This current constraint makes a lot of this work much harder. If we can externalize actual image storage from the graph driver, we make a lot of these problems easier. We are working on a project to make this easier.

LK4D4 commented 8 years ago

@stevvooe @dmcgowan did we implement this differently? Is this still actual?

stevvooe commented 8 years ago

@LK4D4 This is ambitious proposal. If we could divide the problems and solutions, we may be able to make it a little more actionable. There is still likely work to be done to allow a cluster of machines to share "at-rest" image storage.

alex-aizman commented 8 years ago

The steps on the tech side of things are very clear. There's this key concept, call it "centralized repository of immutable layers". The stuff can be designed around this concept, and 'docker pull', 'docker run' and friends will have to be changed accordingly. NFS of course must be one of the transport choices, etc. The works.

stevvooe commented 8 years ago

@alex-aizman In the patches provided, I don't really see anything NFS specific except for a reference to some sort of shared root folder. I think separating the pull-cache, artifact storage and other paths carefully would have the same affect.

However, there is something to be said about data locality. If one starts up the same image simultaneously across several cluster nodes, the NFS server will have to serve up that hot set nearly every time (depending on cache configuration). I can't see this performing much better than pulling from a central registry. In such scenarios, repeating data across disks gives you much better IO scaling (ie broadcast or p2p).

A static artifact store (just a filesystem path, really) that can be shared via arbitrary protocol would probably be a more scalable approach. This could be shared with NFS or bittorrent or anything.

alex-aizman commented 8 years ago

The patch is more than 1 year old, and I'd suggest to maybe move beyond this particular patch at this point - to the original motivation that caused this patch in the first place. Which back when and today remains the same - duplication. It is a shame to keep duplicating the same immutable bits. As far as NFS, see e.g. my text at storagetarget.com. NFS is just a standard and ubiquitous storage transport - today. Nobody in the right mind will say it is optimal for the docker layer images, etc. But NFS exists, it is totally prevalent in the world of file storage, and it is therefore must be designed in...

moby / moby