Consider padding the payload

alexmturner commented 1 year ago

See https://github.com/patcg-individual-drafts/private-aggregation-api#padding

alexmturner commented 1 year ago

Note that the payload being padded is encrypted and only processable by the aggregation service. However, if debug mode is enabled, the cleartext is available; so the main backwards compatibility concern comes from developers using that field. To make feature detection simpler, we could update the version field while making this change.

alexmturner commented 1 year ago

Exploring the options we have for padding the aggregatable payload. Note that the current encoding is specified here.

Append null contributions to a fixed number

One way to mitigate the issue above is to simply add null contributions (i.e. contributions with a value of zero) to the payload until the total number of contributions is fixed at the maximum.

Pros
- No changes needed on the server (although changes are still required if the payload encoding changes)
Cons
- Need to keep the payload encoding having a constant per-contribution size
- The operation field is unprotected
- Potentially fragile: CBOR is not designed to necessarily have a fixed length encoding
- Unclear: performance cost to parsing/filtering out the null entries?

[Preferred] Append bytes to a fixed length

We could instead calculate some maximum size of the payload and pad the CBOR encoding out to that maximum size. Then, when the aggregation service processes the payload, it would first strip out these extra bytes.

The calculation for the number of additional bytes needed for this padding depends on the encoding choices made, but we expect to need up to around 500 bytes of padding in the typical case. If we allow setting the maximum number of contributions up to 50, this could go up to around 1.2 kB. The addition of other operations could expand this significantly, but that is out of scope for now.

There are a few different options for this padding scheme:

ISO/IEC 7816-4

One byte of 0x80 is appended. After that, as many 0x00 (null) bytes are appended as needed to reach the desired length.

Pros
- Supports arbitrarily large amounts of padding
Cons
- May be less performant: linear search required to find the padding boundary

Extension to ANSI X9.23

While we can't use ANSI X9.23 (as it only supports up to 255 bytes of padding), we could extend the concept of ANSI X9.23 to larger padding values. For this, we would add as many 0x00 (null) bytes as needed to reach the desired length. Then, replace the last two bytes with a representation of the count of bytes added as a 16-bit big-endian unsigned integer. (Note that no fewer than two bytes can be used to pad in this case.)

This extension would support up to 64 KiB of padding. While this is more than we need currently, we could also consider using the last four bytes to represent a 32-byte integer; this approach would support up to 4 GiB of padding.

Pros
- May be more performant: only need to read the last two bytes to truncate to the appropriate length
- Supports sufficiently large amounts of padding
Cons
- Custom extension, not an existing standard (to my knowledge)

Some other encoding scheme?

csharrison commented 1 year ago

Are we proposing to make this change along with a change to the CBOR structure to optimize payload size for normal contributions as well? That might reveal a preferred encoding. Note that the linear scan for ISO/IEC 7816-4 is amortized away if you are just consuming contributions one by one until you see 0x80.

alexmturner commented 1 year ago

Yes, we are -- I can try and post some options for encoding optimizations here soon. Although I don't think that should affect the choice of padding (except for needing to keep the constant per-contribution size if go with appending null contributions).

Re: detecting 0x80, we'd have to be careful to distinguish 0x80 appearing in the CBOR serialization vs after it. (And, technically, I think we'd also want to scan the remaining bytes in case there's a later 0x80.) In theory, I think it should be possible to detect when the serialization ends as CBOR encodes how long each map/array should be. But we'd need to ensure that the CBOR parsers used do support this (e.g. Chromium's currently treats trailing bytes as errors).

alexmturner commented 1 year ago

Padding will cause the size of the payload to increase. To mitigate this increase, I've put together some options for improving the efficiency of the payload encoding. Making both changes at the same time would avoid multiple aggregation service version changes.

Analysis of the current encoding

The current encoding is specified here.

// CBOR
{
  "operation": "histogram",
  "data": [{
    "bucket": <16-byte big-endian bytestring>,
    "value": <4-byte big-endian bytestring> 
  }, ...]
}

The current CBOR encoding uses 27 bytes for an unencrypted payload with no contributions. Encryption adds 48 bytes of overhead (32 bytes for the encapsulated secret and 16 bytes for AEAD overhead), leading to 75 bytes for an encrypted empty payload. Then, each contribution adds an additional 36 bytes.

Summary:

75 bytes for encrypted empty payload
+36 bytes per contribution
For 20 contributions: 795 bytes
For 50 contributions: 1675 bytes

Removing repeated strings

There are a few options for removing the repeated “bucket” and “value” strings.

Replace with nested arrays

We can replace each contribution map with a two-member array:

// CBOR
{
  "operation": "histogram",
  "data": [
      [<16-byte big-endian bytestring>, <4-byte big-endian bytestring>],
      ...]
}

This reduces the per-contribution variable component to 23 bytes.

Summary:

75 bytes for encrypted empty payload
+23 bytes per contribution
For 20 contributions: 535 bytes
For 50 contributions: 1225 bytes

Replace with map

We can replace the array of contribution maps with a single bucket to value map. Note that this requires contributions to the same bucket to be consolidated.

// CBOR
{
  "operation": "histogram",
  "data": {
      <16-byte big-endian bytestring>: <4-byte big-endian bytestring>,
      ...}
}

This reduces the per-contribution variable component to 22 bytes.

Summary:

75 bytes for encrypted empty payload
+22 bytes per contribution
For 20 contributions: 515 bytes
For 50 contributions: 1175 bytes

Replace with flat array

We can replace the array of contribution maps with an even-length array. This is equivalent to the 'Replace with nested arrays' option, but flattening the nested two-item arrays.

// CBOR
{
  "operation": "histogram",
  "data": [
      <16-byte big-endian bytestring>, <4-byte big-endian bytestring>,
      ... (repeated unnested as necessary)]
}

This reduces the per-contribution variable component to 22 bytes. However, it worsens the semantic structure of the encoding.

Summary:

75 bytes for encrypted empty payload
+22 bytes per contribution
For 20 contributions: 515 bytes
For 50 contributions: 1175 bytes

Removing top-level strings

We could also consider reducing the fixed cost, i.e. the size of a payload with no contributions, by removing the “operation” and “data” strings. We could then replace them with an array, e.g.:

// CBOR
[
  "histogram",
  {<16-byte big-endian bytestring>: <4-byte big-endian bytestring>,
      ...} // or any other option above
]

This reduces the fixed size by 15 bytes, i.e. from 27 bytes unencrypted to 12 bytes unencrypted and from 75 bytes encrypted to 60 bytes encrypted. This is a limited enough benefit that it may be preferable to keep the current semantic structure.

Summary:

60 bytes for encrypted empty payload
Bytes per contribution depends on other choices above, likely +22 or +23 per contribution
For 20 contributions: 500 bytes or 520 bytes
For 50 contributions: 1160 bytes or 1210 bytes

Miscellaneous other changes

These changes don't affect the maximum size of the encoding, but we could consider them while making other changes.

Switch value encoding to an integer

If we no longer rely on the encoding of a contribution to be a fixed length, we can consider switching the encoding of the value from a 4-byte big-endian bytestring to a simple integer, e.g.:

// CBOR
{
  "operation": "histogram",
  "data": {
      <16-byte big-endian bytestring>: <regular integer>,
      ...}
}

This would not change the maximum size if we want our encoding to support the same 32-bit integer space, but it may be more ergonomic.

(If we decide the encoding should not support numbers beyond 2^16 − 1, we could reduce the per-contribution size by 2. However, this would also be possible for the current bytestring encoding by reducing it to 2 bytes.)

Removing null contributions

If we choose a fixed length padding scheme, there is no longer a need for null reports to contain a ‘null’ contribution (i.e. with value = 0). Instead, we could simply use an empty array/map of contributions. This would not change the maximum size, but it would mean the payload semantics would better match the contents.

alexmturner commented 9 months ago

While the payload is now padded with null contributions, leaving this open to track the proposal of explicitly padding to a certain number of bytes

alexmturner commented 6 months ago

Opened #116 to track improving the payload size efficiency separately.

patcg-individual-drafts / private-aggregation-api