project-machine / puzzlefs

A next-generation container filesystem
Apache License 2.0
393 stars 17 forks source link

puzzlefs: a rough draft of serialization #6

Closed tych0 closed 3 years ago

tych0 commented 3 years ago

Here's some initial code that explores the ideas in the puzzlefs specification.

There are essentially two interesting parts: the actual wire format for puzzlefs and the content defined chunking algorithm. There are a couple other packages (oci for writing the OCI image, and exe for the actual binary) which are not very interesting for the purposes of puzzlefs beyond needing to exist :). The one thing we might want to change here is that we use the OCI v1.1 "blobs/sha256/" path, without doing the git-style "blobs/sha256/$first_byte/$second_byte" to keep the directory listings small. This seems like an opportune time to change that if there ever was one :)

format

Mostly we're using serde to serialize things to a cbor format for now. Perhaps it is reasonable to keep using this format for most objects for the forseeable future, but one place where we really can't use it is the inode table, since that requires inodes to be fixed size according to the spec (so we can do a binary search when resolving inodes on the mount/extract end of things quickly).

For this POC, I've just implemented fixed sized inodes by bit-banging things together in a gross way. However, we need to do this better if we want to generalize it at all, but perhaps it's okay to use serde for everything that's not the fixed-with inode bit.

There are a few other drawbacks to serde, though:

  1. it (reasonably) does not like "extra" data at the end of the stream it is de-serializing. This is a problem for techniques like our "blob ref" technique, where you just seek() to a particular part of the blob and start parsing, because the whole point is to not de-serialize the stream all at once. We could work around this by either 1. serde hacks to ignore extra data or 2. adding a length to everything. but if it's cbor encoded already, it knows its length...

  2. it seems non-good to use one serialization primitive for part of a format, and a different one for another part.

However, it is "free" and gets us off the ground, so we can keep using it for now.

builder

The builder module is the thing that walks the filesystem and actually builds a puzzlefs image. Right now, it only supports walking a single filesystem and generating the base image, and does not support adding deltas, although that's the goal.

There are various content defined chunking algorithms out there. The obvious prior art open source implementation is the one that is used in casync, which uses a buzhash. Around the time casync was being developed, the FastCDC paper was released, which is supposedly much better than most other things out there according to the paper's results (and various other googlable results, see the comment in fastcdc_fs.rs for details). So, we choose this algorithm for now.

The best-looking fastcdc implementation for rust does not exactly implement an API that is friendly for us. Really, we'd like to get a callback as soon as a chunk is generated, so that we can take the data stream and end it, thus only writing it once to the OCI image. However, fastcdc-rs doesn't offer us that API, so we end up just reading chunks into memory. We only process these chunks after every file is written to the stream, so may end up reading single large files into memory. This is fine for now, but presumably we'll want to fix this at some point.

Finally, we need to choose good parameters for FastCDC. We should do some study of this to see what works well (and if we want to allow user-specified sizing, which I think we do not).

For now, I have chosen 10M for the lower bound on chunk size, 40M for the average size, and 256M for the maximum size, based on the assumption that the Ubuntu rootfs is 40M. If we choose average chunk sizes any larger, we will likely not get any sharing across updates of the base rootfs, since the whole thing will always be one chunk.

Of course there is a tradeoff between how many total chunks there are and thus how many are needed to represent larger files. Some study about good parameters would be useful here.

TODO

The next steps are:

  1. implement the "real" wire format (aka use a real parser/unparser) serde doesn't have any fixed size emitters, so we'd end up fighting it most of the way, I think.
  2. implement a reader of the format once the above is mostly reasonable
  3. implement a version of FastCDC that doesn't gobble memory like a hungry teenager. the primitives from the paper are streaming primitives, so I don't think it will be that hard.
  4. implement a delta generator so we can begin to layer (piece? :D) puzzlefs images on top of each other.

Finally, I have given almost no thought to the no_std case for compiling this code, which is likely what we'll want for compiling this code in the kernel. But presumably we'll also want to think about that when choosing libraries, as some offer good no_std support, and some do not.

Signed-off-by: Tycho Andersen tycho@tycho.pizza

tych0 commented 3 years ago

/cc @giuseppe as well.

rchincha commented 3 years ago

Just curious, is there something like a "umociv2". Maybe this code gets organized that way?

tych0 commented 3 years ago

No, there is nothing like "umociv2" right now. There is no ociv2 right now, so nothing for a reference implementation to implement :)

rchincha commented 3 years ago

No, there is nothing like "umociv2" right now. There is no ociv2 right now, so nothing for a reference implementation to implement :)

What I meant was to split/organize this codebase into a umociv2 and its user. oci/lib/src is that I suppose.

tych0 commented 3 years ago

What I meant was to split/organize this codebase into a umociv2 and its user. oci/lib/src is that I suppose.

I believe it is already organized like this...

tych0 commented 3 years ago

Ok, I'm going to go ahead and merge this for now. I'm working on a fuse implementation right now.