secure-systems-lab / dsse

A specification for signing methods and formats used by Secure Systems Lab projects.
https://dsse.dev
Apache License 2.0
65 stars 18 forks source link

Reducing overhead for payload encoding #63

Open AdamZWu opened 11 months ago

AdamZWu commented 11 months ago

The payload field is currently defined as base64 encoded data, which is a reasonable choice for holding arbitrary data.

However, when the payload content is already a well-formed text string, the 33% size increase induced by the base64 encoding starts to feel a bit costly (See https://github.com/in-toto/attestation/issues/289).

Could DSSE offer an "unencoded" mode, where users can directly put raw text string in the payload?

Or are there other alternatives / recommendations?

MarkLodato commented 10 months ago

IMO this needs to be fleshed out more.

AdamZWu commented 10 months ago

Oh good call!

As another alternative, the bundle format selected by in-toto, JSON lines, also offers a compression mode.

trishankatdatadog commented 10 months ago

Cc @dstufft who has an interest in this subject and has run a bunch of experiments

dstufft commented 10 months ago

These are my unedited notes when I was looking into this previously for another project:

I’ve only skimmed the DSSE repository, but it looks like using DSSE turns all of the TUF metadata “opaque” anyways, by wrapping the signed data into a base64 blob that gets stuck in an envelope (could be json, could be something else).

If that’s the case, there’s very little reason to stick with JSON, since the primary benefit is human readability, and we should definitely explore more compact options, typically binary options. We could try CBOR, Protobuf, Smile, BSON, MessagePack, or Ion and see what kind of results we get.

We could also consider compression here as well.

TUF is defined in a serialization independent format, which allows particular applications to select the serialization format that makes the most sense for them, so we need to settle on a serialization format that makes sense for us.

Our Constraints and Goals:

  • Serialized size should be as small as reasonably possible.
  • Deserialization must be both memory and CPU efficient.
  • Deserialization must be able to be done in pure Python or using the standard library.
  • For convenience, we are ignoring how signatures themselves are constructed for this.
  • We cannot assume that we have control over the deployed version of either the serializer or the deserializer, and we need to maintain as much forwards and backwards compatibility as possible.
  • We can consider compression of the serialized content.

Options considered:

  • “Canonical JSON”, as the format initially used by TUF.
    • Pro: Can be implemented using nothing but the standard library.
    • Pro: Fully human readable.
    • Pro: No schema to synchronize between consumers and producers.
    • Con: Uses a canonicalization scheme for signing, which is more fragile than traditional signing envelopes.
  • DSSE + JSON
    • Pro: Can be implemented using nothing but the standard library.
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Con: Envelope treats the inner payload as binary, which makes the non envelope content opaque until after opening the envelope.
  • DSSE + MessagePack
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary serialization, can represent binary values without using base64 encoding.
    • Con: Requires a library with a C extension for speed, but it does have a Pure Python fallback available.
  • DSSE + Ion:
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary and Text serialization are both available, allowing tuning between human readable or binary compactness.
    • Con: Requires a library with a C extension for speed, but it does have a pure Python fallback.
    • Con: Packaging appears to be fetching the library by downloading it from an URL in the setup.py.
  • DSSE + CBOR:
    • Pro: No schema to synchronize between consumers and producers.
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary serialization, can represent binary values without using base64 encoding.
    • Con: Requires a library with a C extension for speed, but it does have a Pure Python fallback available.
  • DSSE + Protobuf
    • Pro: Traditional signing envelope that is misuse resistant.
    • Pro: Binary serialization, can represent binary values without using base64 encoding.
    • Pro: Able to be backwards and forwards compatible.
    • Con: Compatibility requires being somewhat careful when evolving the schema.
    • Con: Requires distributing a schema (in the form of a .proto file) to producers and consumers.
    • Con: Requires a library with a C extension for speed, but it does have a Pure Python fallback available.

I was specifically looking at TUF, and I was looking primarily at the snapshot role since that role will almost certainly be the largest file in my particular use case (TUF on PyPI), and I created a sort of torture test with ~500k delegations. You can see the actual files of that at dstufft/tuf-serialization (requires git-lfs) but the results basically end up looking like this:

❯ ls -lhR output
output/root:
Permissions Size User    Group   Date Modified    Git Name
.rw-r--r--  3.5k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.xz
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.canonical.json.zst
.rw-r--r--  3.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.cbor.zst
.rw-r--r--  2.8k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.br
.rw-r--r--  1.5k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.xz
.rw-r--r--  1.5k dstufft dstufft 2023-10-17 10:52  --  root.dsse.ionb.zst
.rw-r--r--  4.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont
.rw-r--r--  2.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.br
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.gz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.xz
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.iont.zst
.rw-r--r--  4.1k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json
.rw-r--r--  2.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.br
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.gz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.xz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.json.zst
.rw-r--r--  3.5k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.jsont.zst
.rw-r--r--  3.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.msgpack.zst
.rw-r--r--  2.6k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.br
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.gz
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.xz
.rw-r--r--  1.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.proto.zst
.rw-r--r--  2.7k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb
.rw-r--r--  1.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.br
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.gz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.xz
.rw-r--r--  1.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.sionb.zst
.rw-r--r--  4.0k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont
.rw-r--r--  2.2k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.br
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.gz
.rw-r--r--  2.4k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.xz
.rw-r--r--  2.3k dstufft dstufft 2023-10-17 10:52  --  root.dsse.siont.zst

output/snapshot:
Permissions Size User    Group   Date Modified    Git Name
.rw-r--r--   75M dstufft dstufft 2023-10-17 10:52  --  snapshot.canonical.json
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:54  --  snapshot.canonical.json.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:52  --  snapshot.canonical.json.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:53  --  snapshot.canonical.json.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:55  --  snapshot.canonical.json.zst
.rw-r--r--   65M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor
.rw-r--r--   31M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.xz
.rw-r--r--   34M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.cbor.zst
.rw-r--r--   55M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb
.rw-r--r--   30M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.br
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.gz
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.ionb.zst
.rw-r--r--   94M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.br
.rw-r--r--   45M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.gz
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.xz
.rw-r--r--   40M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.iont.zst
.rw-r--r--  101M dstufft dstufft 2023-10-17 10:55  --  snapshot.dsse.json
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:58  --  snapshot.dsse.json.br
.rw-r--r--   45M dstufft dstufft 2023-10-17 10:55  --  snapshot.dsse.json.gz
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:56  --  snapshot.dsse.json.xz
.rw-r--r--   40M dstufft dstufft 2023-10-17 10:59  --  snapshot.dsse.json.zst
.rw-r--r--   81M dstufft dstufft 2023-10-17 10:59  --  snapshot.dsse.jsont
.rw-r--r--   32M dstufft dstufft 2023-10-17 11:02  --  snapshot.dsse.jsont.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:59  --  snapshot.dsse.jsont.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 11:00  --  snapshot.dsse.jsont.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.jsont.zst
.rw-r--r--   65M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.msgpack
.rw-r--r--   31M dstufft dstufft 2023-10-17 11:06  --  snapshot.dsse.msgpack.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.msgpack.gz
.rw-r--r--   33M dstufft dstufft 2023-10-17 11:03  --  snapshot.dsse.msgpack.xz
.rw-r--r--   34M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.msgpack.zst
.rw-r--r--   57M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto
.rw-r--r--   31M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.br
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.gz
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.xz
.rw-r--r--   35M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.proto.zst
.rw-r--r--   55M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb
.rw-r--r--   30M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.br
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.gz
.rw-r--r--   32M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.xz
.rw-r--r--   33M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.sionb.zst
.rw-r--r--   94M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont
.rw-r--r--   37M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.br
.rw-r--r--   45M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.gz
.rw-r--r--   36M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.xz
.rw-r--r--   40M dstufft dstufft 2023-10-17 10:26  --  snapshot.dsse.siont.zst

I've gone ahead and added what I think this issue is proposing, which is serializing the DSSE payload as a utf8 text string when it's JSON (that's .jsont). I've left the signatures themselves encoded as base64 in this case since they are proper binary data (though it's possible there's a more compact encoding that could be used than base64, but there's only a few signatures so it's pretty low value to find a better serialization for the signatures).

As you can see, the current DSSE serialization, when used from JSON ends up producing a 101M TUF snapshot, but if you compress that it drops down to 36-45M. Without compression the ion binary encoding is the smallest at 55M, it's also the smallest compressed by by a much smaller margin.

The proposed utf8 encoding reduces the snapshot role from 101M down to 81M, and likewise there is a decrease in compressed size as well.

Sorry for the brain dump, but hopefully this is useful in some way?

dstufft commented 10 months ago

Oh one other note, is that a possibly interesting property here is that the proposed JSON + UTF8 encoding, when compressed, isn't the absolute smallest in the snapshot test, but it's closer than the current JSON + Base64 is, and on the smaller root role test, it's basically tied for the smallest.

To me, that ends up representing a really nice trade off, because the other serialization options, while generally available, are not nearly as ubiquitous as JSON as, and if you're worried about space constraints, the fact that wrapping the entire thing in any of the very ubiquitous compression algorithms brings this JSON + UTF8 encoding scheme in line with the smallest of the other options is a strong incentive to use it.

One thing of note, is that my tests above assume you're using the same serialization scheme for both the payload and the DSSE envelope, but you're only compressing the final output of the DSSE. Arguably you might want to compress just the payload since it's bound to be the largest part of the DSSE output, and that would mean that you can validate the signatures prior to decompressing (which decompressing first should be safe in DSSE I think, but I'm always nervous around cryptography + compression).

trishankatdatadog commented 10 months ago

Excellent, thank you @dstufft! Before we decide on anything here, I'd love to see: (1) requirements, (2) constraints, and (3) numbers (like Donald produced) to back up findings.

AdamZWu commented 10 months ago

Arguably you might want to compress just the payload since it's bound to be the largest part of the DSSE output

@dstufft: maybe this is a bit out-of-scope, strictly for DSSE, but I described a rationale that whole DSSE compression (and deferred to the upper-level context) could practically work out better than payload-only compression in the in-toto thread: in-toto has selected JSON line as the attestation bundle format, and in a bundle there will likely be multiple attestations of various kind, all for the same set of artifacts. So every in-toto statement "subject" array will contain a copy of identical content (but may not be in the same presentation, as neither in-toto has ordering requirement for subjects, nor does JSON dict serialization guarantee ordering).

(And for some builds that generate 1000s of files or more, the "subject" array constitutes the major bulk of the attestation size.)

If we compress the payload, or compress the DSSE envelope piece-wise:

  1. The compression will not have the context/visibility of the "subject" duplication of other envelope payloads;
  2. The resulting compressed data may be scrambled such that the "subject" duplication is no longer evident at the bundle-level, hence deterring bundle-level compression from gaining effective reduction.

If we were to embrace bundle level compression, I think the best DSSE can do is to preserve the original data as much as possible. To that end, UTF8 encoding works better than Base64 because there is less "clobbering" (that is, when the payload is already UTF8-compatible).

For example, there are two attestations in a bundle, both for the same set of two files:

UTF8 encoding will allow a compression algorithm to easily discover the duplicated information; However, after Base64 encoding, the data will look like:

It would take a really smart compression algorithm to tell that the middle parts actually contain the same information. (I did a quick test throwing the original subjects and the base64 encoded ones at gzip, xz, and lzma, in all cases the compress base64 is ~2x the size of compressed original, indicating those algorithm failed to detect the underlying redundancy -- probably not a very realistic test, just to illustrate the problem 😄 ).

trishankatdatadog commented 10 months ago

@AdamZWu I would not conflate in-toto Bundle-level compression with in-toto Attestation-level compression, so if you're looking for the former, I argue that it is out of the scope of this project.

AdamZWu commented 10 months ago

@trishankatdatadog given the interactions between DSSE and in-toto, I think it is not completely out of scope.

Since in-toto attestation (and bundle) is a major applicator of DSSE, I think supporting in-toto to efficiently reduce bundle size is to both specs' best interest. And for that, my current thinking is to add a "utf-8 encoding" mode for payload, which allows minimum "clobbering" for payload that is already in utf-8 encoding (which is always the case for in-toto).

Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle.

MarkLodato commented 10 months ago

OK, I'm fairly convinced by the fact that the compressed UTF-8-encoded JSON is only ~5% larger than the compressed proto encoding, while compressed base64-encoded JSON is ~20-80% larger than compressed proto.

root.dsse.json.gz    2.3k  (~75% overhead)
root.dsse.jsont.gz   1.4k  (~5% overhead)
root.dsse.proto.gz   1.3k

snapshot.dsse.json.gz   45M  (~20% overhead)
snapshot.dsse.jsont.gz  37M  (~0% overhead)
snapshot.dsse.proto.gz  37M

It would still be nice to gather a larger corpus of real-world data (not just two files) and do the comparison, but assuming it holds, then that's fairly compelling.

I'm assuming we'd add a new field like this?

message Envelope {
  // Message to be signed.
  // REQUIRED.
  oneof payload_encoding {
    // Raw bytes. In JSON, this is encoded as base64.
    bytes payload = 1;
    // Unicode string, where the signed byte stream is the UTF-8 encoding. In JSON, this is a regular unicode string.
    string payloadUtf8 = 4;
  }
}

Note: The signature algorithm does not change at all, and existing signatures could be re-encoded with this new field without invalidating them.

MarkLodato commented 10 months ago

Adam, do you want to put together a PR that implements this? I think you'd need to edit the proto, the envelope.md, and other references to the payload. (I don't think protocol needs to be updated.)

The other question here is backwards compatibility. Old consumers won't be able to DSSEs with the new field. I don't see any security concern, but I don't know how best to roll this out.

(Edit: To clarify, I'm not saying this is accepted by DSSE, rather that it's helpful to have a concrete proposal that we can discuss for making a group decision.)

AdamZWu commented 10 months ago

Sounds good. I will put something up for review next week. :D

trishankatdatadog commented 10 months ago

Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle.

Adam, are you arguing that Attestation-level compression will automatically help with Bundle-level compression? If so, then yes, I agree. However, the two levels of compression are distinct from each other.

AdamZWu commented 10 months ago

Doing so would allow bundle level compression to more effectively discover data duplication across multiple envelopes (see my above post), which is one of the big sources of bloats in an attestation bundle.

Adam, are you arguing that Attestation-level compression will automatically help with Bundle-level compression? If so, then yes, I agree. However, the two levels of compression are distinct from each other.

Not exactly.

Yes, what DSSE does to payload will definitely affect bundle-level compression performance. But it looks to me the more complex processing DSSE does, probably the worse the bundle-level compression will perform, because complex mutations will hide cross-envelope data duplication. So I think DSSE could offer a mode that does less, e.g. allowing UTF-8 encoding for payload, so that JSON serialized in-toto statement which is already in UTF-8 can be presented pretty much unchanged (except for JSON string escapes).

And yes, DSSE's payload encoding is completely orthogonal to in-toto attestation bundle compression. Just some encoding (e.g. UTF-8) is much friendlier to bundle-level compression than the other (e.g. base64).

trishankatdatadog commented 10 months ago

Yes, what DSSE does to payload will definitely affect bundle-level compression performance. But it looks to me the more complex processing DSSE does, probably the worse the bundle-level compression will perform, because complex mutations will hide cross-envelope data duplication. So I think DSSE could offer a mode that does less, e.g. allowing UTF-8 encoding for payload, so that JSON serialized in-toto statement which is already in UTF-8 can be presented pretty much unchanged (except for JSON string escapes).

Oh, I see, thanks for the clarification! Hmm, now I'm curious about how the Attestation-level choice of payload encoding would affect the efficacy of Bundle-level compression. As you suggest, UTF-8 should work better for this.