project-machine / puzzlefs

A next-generation container filesystem
Apache License 2.0
393 stars 17 forks source link

Oci manifest format of puzzlefs #55

Open ariel-miculas opened 1 year ago

ariel-miculas commented 1 year ago

The current puzzlefs manifest format is as follows:

$ target/debug/puzzlefs build ../test-puzzlefs/simple_rootfs /tmp/oci-simple first_try
$ cat /tmp/oci-simple/index.json | jq .
{
  "schemaVersion": -1,
  "manifests": [
    {
      "digest": "sha256:ddf711c6a55e0f90d6b85d487cc0f202a2189cf12ffb15851b27984dda74e414",
      "size": 55,
      "media_type": "application/vnd.puzzlefs.image.rootfs.v1",
      "annotations": {
        "org.opencontainers.image.ref.name": "first_try"
      }
    }
  ],
  "annotations": {}
}
$ file /tmp/oci-simple/blobs/sha256/ddf711c6a55e0f90d6b85d487cc0f202a2189cf12ffb15851b27984dda74e414
/tmp/oci-simple/blobs/sha256/ddf711c6a55e0f90d6b85d487cc0f202a2189cf12ffb15851b27984dda74e414: data
~/work/cisco/puzzlefs expose-add-rootfs-delta*
$ hexdump -C /tmp/oci-simple/blobs/sha256/ddf711c6a55e0f90d6b85d487cc0f202a2189cf12ffb15851b27984dda74e414
00000000  a1 69 6d 65 74 61 64 61  74 61 73 81 58 29 00 00  |.imetadatas.X)..|
00000010  00 00 00 00 00 00 01 00  ea 22 b7 85 11 76 3a 07  |........."...v:.|
00000020  97 70 7b e5 3a 52 f8 69  93 44 b5 02 7a bf 0f 6d  |.p{.:R.i.D..z..m|
00000030  f2 69 b8 0b 6b 44 26                              |.i..kD&|
00000037

Whereas for oci v1, the manifest has the following format:

$ cat oci/index.json | jq .
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "digest": "sha256:38d1071460074ff45300379f7d88d1057071c4348ab9819fa59c6083d159eba1",
      "size": 589,
      "annotations": {
        "org.opencontainers.image.ref.name": "first"
      }
    }
  ]
}

$ file oci/blobs/sha256/38d1071460074ff45300379f7d88d1057071c4348ab9819fa59c6083d159eba1
oci/blobs/sha256/38d1071460074ff45300379f7d88d1057071c4348ab9819fa59c6083d159eba1: JSON text data
$ cat oci/blobs/sha256/38d1071460074ff45300379f7d88d1057071c4348ab9819fa59c6083d159eba1 | jq .
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:c45108df90c1fceb9ce8b0d9b8aa3f09f1e7e34d29ae44928ae26e259c0282ce",
    "size": 1222
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 83518086
    }
  ],
  "annotations": {
    "io.stackeroci.stacker.git_version": "v0.30.1-9-g5e775ca",
    "io.stackeroci.stacker.stacker_yaml": "first:\n  from:\n    type: docker\n    url: docker://centos:latest\n"
  }
}
hallyn commented 1 year ago

The OCIv1 manifest format is specified at https://github.com/opencontainers/image-spec/blob/main/manifest.md . I think we should stick to something closer to that.

Perhaps:

$ cat oci/index.json | jq "."
{
  "schemaVersion": 3,
  "manifests": [
    {
      "digest": "sha256:6b7980a6390ed4614465ec87388856583313cf0125deab02be0256c23a3cb006",
      "size": 55,
      "media_type": "application/vnd.puzzlefs.image.manifest.v1",
      "annotations": {
        "org.opencontainers.image.ref.name": "firstimage"
      }
    }
  ],
  "annotations": {}
}
$ cat oci/blobs/sha256/6b7980a6390ed4614465ec87388856583313cf0125deab02be0256c23a3cb0 | jq "."
{
  "schemaVersion": 3,
  "mediaType": "application/vnd.puzzlefs.image.manifest.v1",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:c45108df90c1fceb9ce8b0d9b8aa3f09f1e7e34d29ae44928ae26e259c0282ce",
    "size": 1222
  },
  "config": {
    "mediaType": "application/vnd.puzzlefs.image.metadata.v1",
    "digest": "sha256:c45108df90c1fceb9ce8b0d9b8aa3f09f1e7e34d29ae44928ae26e259c0282ce",
    "size": 55
  },
  "files": [
    {
      "mediaType": "application/vnd.puzzlefs.image.filedata.v1",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 403
    },
    {
      "mediaType": "application/vnd.puzzlefs.image.filedata.v1",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 797
    },
    {
      "mediaType": "application/vnd.puzzlefs.image.filedata.v1",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 679
    },
    {
      "mediaType": "application/vnd.puzzlefs.image.filedata.v1",
      "digest": "sha256:a38ced06f7cf1d2b235ffa81f165924cecddac544c0d915d13cffbe47ea29b56",
      "size": 561
    }
  ],
}

Explanation:

  1. Config is the runtime container config. We should still ship that.
  2. The application/vnd.puzzlefs.image.metadata.v1 points to what we currently are making our 'manifest'.
  3. The files[] array lists all the blobs so that a higher level tool/script can tell easily all the files that are needed out of this oci layout in order to copy the image. puzzlefs itself wouldn't need it since it can derive that from its own manifest, but that won't help puzzlefs if all the needed files/chunks aren't there :)
ariel-miculas commented 1 year ago

It does seem a little weird to duplicate the information in both a json format and a custom capnproto format. What's more, the notion of a layer in OCIv1, which is contained in a single file, doesn't map well with the puzzlefs concept of having a metadata file and multiple data files for a single layer. We could abuse the format and make each metadata/data file a single layer, that may work for getting the existing tools to copy the files, but it doesn't seem like a good design decision.

ariel-miculas commented 1 year ago

@hallyn what do you think?

hallyn commented 1 year ago

Well, failing a good idea for an alternative, let's leave it as is for now and re-open if we come up with something.

ariel-miculas commented 1 month ago

Now that I'm working on the stacker support for building PuzzleFS images, I think it's time to revisit this issue and the delta generation. Should we stick to the original OCI image manifest specification? We would need new image media types, but I'm wondering whether being close to the OCI spec would make it easier for existing tools to work with PuzzleFS images. For mounting the PuzzleFS image in kernel, we would still need to have the manifest and layers in capnp format, maybe they could coexist. We could add a new media type for the PuzzleFS layer which would point to the PuzzleFS metadata and then add support for parsing this new media type (e.g. making sure we add all the chunks pointed to by the metadata layer to the oci data store, i.e. blobs/sha256). Not sure how well the existing tools would deal with this, since there would be no references to these chunks/blobs from the usual json content descriptors, all the references would be only stored in the PuzzleFS metadata file, which is in capnp format. This approach is also hinted by Aleksa Sarai at the end of his blog post. Or we could try this model proposed by Serge. Another reason why we would want to stick close to the OCI format is to keep the OCI configuration format, which keeps information such as architecture, os, environment variables etc, which do not change if we generate a PuzzleFS image. On the other hand, the OCI format it tightly coupled with the notion of layering, which we don't want to do with PuzzleFS. Deduplication is achieved by splitting the filesystem in chunks with the CDC algorithm, and sufficiently similar images should end up sharing most of the chunks. Since PuzzleFS doesn't fit the OCI model, we might as well not care about being compatible with it. This would however complicate the addition of other features, such as support for running a PuzzleFS container. Besides, we would need to take care of generating all the relevant OCI (or inspired from OCI) metadata bits and pieces.

@tych0, @hallyn do you have any thoughts on this?

tych0 commented 1 month ago

Hey, sorry for the delay.

Not sure how well the existing tools would deal with this, since there would be no references to these chunks/blobs from the usual json content descriptors, all the references would be only stored in the PuzzleFS metadata file, which is in capnp format.

I ran into this problem a bunch with tools when I did stacker's squashfs support, and filed stuff like https://github.com/opencontainers/image-spec/pull/816 in support of it. I got it all plumbed through, and hopefully did it in a way that future-proofed it for puzzlefs, so I think a new mime type is a good path forward, especially since stuff like storage and hosting (i.e. the distribution spec) makes it so that you don't have to build tooling for those parts.

Or we could try this https://github.com/project-machine/puzzlefs/issues/55#issuecomment-1355484466.

I think it's reasonable in a vacuum, but you would have to teach other tools (skopeo, dist spec) about this new format, which is kind of annoying.

On the other hand, the OCI format it tightly coupled with the notion of layering

There are two explicit mentions of layering, descriptors and history.

I think that for History, we'll still have this concept: users will build puzzlefs images by individual mutations to them (apt-get install python3, curl https://sh.rustup.rs | sh, cargo build myapp, etc.), which are still "layers". It's just that the underlying fs representation won't be 1:1 with that any more, because it's more efficient. But this idea of "here's the step that generated this delta" is still reasonable, IMO.

So what's left is Descriptors, which, while called Layers in the manifest, could be "just" a list of BlobRefs. Admittedly they're not layers, but the delta is so small, and the amount of work to generate the rest of the tooling is so great, that I would lean towards just re-using the OCI spec here. Maybe we can send some clarifying PRs that "not all OCI images need be layer based" or something?

Thank you for continuing to push on this, it's awesome!

ariel-miculas commented 1 month ago

Thanks for your input, @tych0 So what you're saying is we should abuse the oci Image manifest specification so that the existing tools will copy the necessary BlobRefs that we need for Puzzlefs. It would look something like this:

 "layers": [
   {
      "mediaType": "application/vnd.puzzlefs.image.rootfs.v1",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 83518086
    },
    {
      "mediaType": "application/application/vnd.puzzlefs.image.inodes.v1",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 83518086
    },
    {
      "mediaType": "application/vnd.puzzlefs.image.filedata.v1",
      "digest": "sha256:a1d0c75327776413fa0db9ed3adcdbadedc95a662eb1d360dad82bb913f8a1d1",
      "size": 83518086
    },
  ],

where

When mounting the image, PuzzleFS will parse the list of layers, extract the application/vnd.puzzlefs.image.rootfs.v1 manifest, and then use the information provided there to mount the image. Optionally it could compare the list of BlobRefs from the OCI Image manifest to the list of BlobRefs from the PuzzleFS manifest and metadata layers.

The main advantages would be compatibilty with existing tools and decoupling the PuzzleFS merkle tree structure from the OCI Image Manifest. The disadvantage is that we are duplicating the information in two places and formats: once in the OCI Image manifest, and once in the PuzzleFS manifest and PuzzleFS metadata layers.

Did I get this right? @mikemccracken @raharper @rchincha any thoughts on this?

tych0 commented 1 month ago

Did I get this right?

Heh, I don't think I quite got it right, I had forgotten that you needed mime types for the layers. It seems like a bit of a hack, but yes, that's what I had in mind.

(Is there a reason inodes is not part of rootfs?)

ariel-miculas commented 1 month ago

I think this was the original design even when we had cbor serialization. And we do have layers in PuzzleFS right now, and that's another thing to consider when designing the OCI format of PuzzleFS.

We could include the entire PuzzleFS metadata in one single capnp file, that way we'll only have application/vnd.puzzlefs.image.rootfs.v1 and application/vnd.puzzlefs.image.filedata.v1.

tych0 commented 1 month ago

I think this was the original design even when we had cbor serialization.

Definitely a mistake then :).

And we do have layers in PuzzleFS right now, and that's another thing to consider when designing the OCI format of PuzzleFS.

Yeah, it's a good point. It's almost as if OCI's "layers" is just transport for bits, and we want to allow images to have more than just the OCI's version of Metadata, Config, and Layers.

I suppose another option is that we could add pointers as Annotations on metadata, but then tools will not automatically transport them. IMO the way you have it above is probably the best because we can use existing tooling, even if it is slightly confusing.

We could include the entire PuzzleFS metadata in one single capnp file, that way we'll only have application/vnd.puzzlefs.image.rootfs.v1 and application/vnd.puzzlefs.image.filedata.v1.

that sounds reasonable to me.

ariel-miculas commented 2 weeks ago

We should add a skopeo copy integration test and then we can close this issue.