project-machine / puzzlefs

Apache License 2.0
378 stars 18 forks source link

Change the metadata serialization from cbor to capnproto #97

Closed ariel-miculas closed 1 year ago

ariel-miculas commented 1 year ago

Capnproto [1] is a fast data interchange format which uses zero-copy serialization, meaning there is no encoding/decoding step.

Advantages over cbor:

Disadvantages:

Other changes:

For a Ubuntu 20.04 distribution (single layer puzzlefs image) the amount of space taken by the metadata is: Cbor serialization: puzzlefs manifest: 869K puzzlefs metadata layer: 4.2M

Capnproto serialization: puzzlefs manifest: 662K puzzlefs metadata layer: 3.8M

https://capnproto.org/ [1] https://capnproto.org/encoding.html#structs [2] https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html [3] https://lore.kernel.org/rust-for-linux/20230609063118.24852-1-amiculas@cisco.com/T/#m22c3db8ba98f2905204d587563ef116a97ce0415 [4]

Fixes #18

ariel-miculas commented 1 year ago

@tych0 I thought you might have an input on this

tych0 commented 1 year ago

Big +1 from me, all of your disadvantages listed are just hacks I threw in to get something going. As Kent says, it's always possible the language/library ergonomics improve, so I don't think these conversion methods are a huge drawback right now.

The no-optional-fields thing seems a little weird... it feels like filesystems/databases/formats etc. could definitely have optional fields in some cases. But this can mostly be handled all in one place in your to_/from_ methods, so it doesn't seem too annoying. I'd want to keep the optional types actually Option-al in the Rust code, though.

BlobRef now always holds a digest, it never refers to an offset in the same file

It's been a while, but does this mean that BlobRef is redundant?

ariel-miculas commented 1 year ago

I think I wasn't clear when I said there is no support for optional fields. In a way, all the fields are optional because there is no way to mark a field as "required". This makes things a bit convoluted, because calling "get" on a missing field still returns its default value, even if it is actually missing. So the validation code has to call "has_" to actually check whether the field is present or not. See "How do I make a field “required”, like in Protocol Buffers?" and "How do I make a field optional?" in https://capnproto.org/faq.html for details.

BlobRef is used to address file data, using a digest to identify the "data chunk" (produced by the FastCDC algorithm) and an offset in this chunk. It was also used to refer to an offset in the same file, but this is no longer required since Capnproto handles this. So that was the part that I've removed from BlobRef.

tych0 commented 1 year ago

I see, makes sense, thanks!

ariel-miculas commented 1 year ago

@koverstreet I thought you might be interested in this PR which changes the metadata format from CBOR to Capnproto (as you suggested). This is a fuse implementation of puzzlefs and I'm now working on getting Capnproto into the Linux kernel. I expect the API related to Capnproto to be more or less the same in the kernel (except I won't regenerate the rust code using a build script, since that introduces a dependency on the capnp binary and I also don't expect frequent changes to the data format).

koverstreet commented 1 year ago

Really happy to see this!

Option<> is another thing I'd love to see in Cap'n Proto. It's got tagged unions which are pretty close; it should be possible to map these to an Option<>

ariel-miculas commented 1 year ago

Since a structure is optional in Capnproto, isn't it semantically an Option? For primitive fields you can't get an Option, since they can't be missing.

koverstreet commented 1 year ago

Well, structures are optional due to an implementation detail; they have to be behind pointers so they can have fields added. If we add frozen types (from Swift) to Cap'n Proto, we'll be able to have structs that aren't behind a pointer but can't be changed.

Better question: do we want to expose that aspect of structs, when we do language level types? We might have to, because you can't take a reference to something that doesn't exist, but I don't think we want to if we can avoid it - it's the NULL problem all over again; the user should be able to decide which fields are Options<> and which aren't when designing their types.

tych0 commented 1 year ago

Better question: do we want to expose that aspect of structs, when we do language level types? We might have to, because you can't take a reference to something that doesn't exist, but I don't think we want to if we can avoid it - it's the NULL problem all over again; the user should be able to decide which fields are Options<> and which aren't when designing their types.

Absolutely agree here. I'd be curious about lending a hand to do any of the implementation stuff to make it work, though I have no idea what's involved.

koverstreet commented 1 year ago

I think there's three main angles to work on.

There's doing up-front bounds checking so that get/set can't return errors - that can be done purely within the Rust capn'p library, so shouldn't be too hard.

For proper language-level types, we need to talk to Rust compiler people about computed properties and find out if this is already being explored or if there's someone working in that area of the compiler that could lend a hand. Miguel?

Lastly, I think we'll want to look at extending the Capn'p spec for frozen types, and perhaps also supporting real (ADT) enums better. If we get started on the first two I'll start talking to people and writing something up.

hallyn commented 1 year ago

So, this is a breaking change, right? Prior versions can't mount images created by new version, and vice versa?

ariel-miculas commented 1 year ago

Yes, it's a breaking change.

hallyn commented 1 year ago

So PUZZLEFS_IMAGE_MANIFEST_VERSION should be bumped?

ariel-miculas commented 1 year ago

In theory yes, and we should also bump the versions in Cargo.toml. In practice, I'm not sure, since we don't have any users of puzzlefs.

hallyn commented 1 year ago

In theory yes, and we should also bump the versions in Cargo.toml. In practice, I'm not sure, since we don't have any users of puzzlefs.

Well, that you know of :)

But, sounds good. Thanks.