Request for clarification: array length and mixed encoding

deeglaze commented 5 months ago

I'm noodling through the idea of a modular evidence collection daemon that just runs installed binaries in a certain location and assembles a cmw-collection together to send off, and I'm wondering how much interpretation of the output it would need, assuming all the binaries output a legal CMW of either CBOR or JSON encoding.

When JC<> is used, it's only when the CDDL pun breaks down and you need base64 encoding for binary. If you have your cmw-collection as a CBOR map, and we can use, say, the basename of the evidence collection binary to key the result, but the result is just bytes that are either CBOR or JSON, that doesn't seem to fit into the "cmw is either CBOR or JSON".

It seems that certainly a JSON cmw-collection that contains a CBOR-encoded cmw does not parse unless you're allowed to wantonly swap your interpretation of the CDDL as for JSON or CBOR as you see fit (say with the decapsulation algorithm), given that cmw doesn't have a base64 encoded string alternate to store a CBOR-encoded cmw.

Can we say that

cmw-collection = {
  + cmw-collection-entry-label => cmw
}

should be

cmw-collection = {
  + cmw-collection-entry-label => cmw-decap
}
cmw-decap = JC<jc-cmw, bytes .cbor cmw>
jc-cmw = cmw / base64-cmw
base64-cmw = base64-string .cbor cmw

I'm unsure if the .cbor control operator is allowed to be applied to a base64-string technically it's for byte strings and since "one can use CDDL with JSON by limiting oneself to what can be represented in JSON. Roughly speaking, this means leaving out byte strings"

I don't think we can use bytes .cbor cmw without inserting the major type byte 0x04 before the output of any of the evidence modules, so do we just define a .feature cmw-decap to apply the decap algorithm to arbitrary bytes instead of this JC stuff?

deeglaze commented 5 months ago

I would similarly ask for cmw to itself contain cmw-collection, since something producing evidence could itself be amassing evidence, and you end up with a tree rather than having to flatten it and potentially have conflicting keys.

nedmsmith commented 5 months ago

Is this different from cmw-collection containing a cmw / cmw-collection?

nedmsmith commented 5 months ago

I'm wondering how much interpretation of the output it would need, assuming all the binaries output a legal CMW of either CBOR or JSON encoding.

The I-D describes an ASN.1 encoding that distinguishes between cbor and json.

CMWCollection ::= CHOICE {
    json UTF8String,
    cbor OCTET STRING,
}

Is the request to give the same attention to the CDDL description?

deeglaze commented 5 months ago

I'm wondering how much interpretation of the output it would need, assuming all the binaries output a legal CMW of either CBOR or JSON encoding.

The I-D describes an ASN.1 encoding that distinguishes between cbor and json.
CMWCollection ::= CHOICE {
    json UTF8String,
    cbor OCTET STRING,
}
Is the request to give the same attention to the CDDL description?

Not exactly, since I'm asking about mixed representation across values. If I'm accumulating evidence from A and B, where A produces a CBOR CMW, and B produces a JSON CMW, I would rather not do a full translation from one to the other for fear of corrupting the data and taking too much processing time, so I would like to produce a cbor-encoded map {"A" => cbor bytes, "B" => json string}. If A were to produce a CMW collection instead of CMW, since it's an aggregator itself, then I'd like to not have to open it up and re-encode the key => value mappings in the overall CMW collection, since I might have conflicting keys. I also am not particularly sure what .feature "cbor" means–"if I am being interpreted as CBOR" or "If I am to be interpreted as CBOR" or something else, since one depends on context, and the other is a kind of context switching device.

nedmsmith commented 5 months ago

not particularly sure what .feature "cbor" means

See https://www.ietf.org/archive/id/draft-ietf-cbor-cddl-control-02.html I believe int .feature "cbor" is saying the integer is encoded as a cbor integer.

thomas-fossati commented 5 months ago

[a]ssuming all the binaries output a legal CMW of either CBOR or JSON encoding.

[ challenging assumptions 😺 ]

Why don't you let the repackager do the repackager and the providers do the providers? I.e., the binaries return their evidence payload plus the associated media type, and the daemon puts together the CMWs and wraps them in a CBOR or JSON collection.

deeglaze commented 5 months ago

The "plus" there is the problem then, since now I need to specify an output format for the binaries that isn't the standard CMW / CMW-collection. What you're suggesting to me sounds like "binaries return their cmw-array value and cmw-array content type in some format for the repackager to wrap as a CMW." The third component of the cmw-array is just lost or somehow inferred? No, that "some format" to me is a CMW. I don't want plugin authors to have to implement both serialization formats as requested by the repackager to fit the final output type because that just becomes bloat.

To be parsimonious with resources due to static linking overhead, some evidence collectors may want to produce an output that is already a collection of multiple evidence formats. The cc-trusted-api and go-tpm-tools clients collect multiple evidence formats already, so changing the output format for the collection is a nicer thought than splitting each format into a different binary. We therefore have either a CMW or some sort of CMW collection as allowed response types. What format do we specify?

I could say it's a mixed format of JSON or CBOR CMWs given as a definite length array of definite length octet strings for the repackager to decode and repackage into a CMW-collection of its own, but the entry labels would need to be generated somehow.

4 bytes: num_outputs in system endianness
Variable: num_outputs-many output

where an output is
4 bytes: length in system endianness
length: payload bytes

Part of the assumption to be fair is that there is going to be a standard evidence request format and evidence response format, and the response doesn't necessarily need to be a CMW or CMW collection, though it'd be nicer if it were–the collection CMW / CMW-collection output of either CBOR or JSON to repackage as a single CMW collection leads to the relabeling problem I described in OP. In JSON, the label could be the provider's name followed by the index of the position in the output, and in CBOR it could just be the index of the position in the output added to the current index to the output CMW–the label is thus very different. The text describing the labels suggests that they aren't arbitrary but could hold semantic meaning like a mnemonic. I don't know how that works for the integer format, but it's not explicitly specified to be irrelevant:

Instead, the labels identify a conceptual message that, in the case of a composite Attester, should typically correspond to a component of a system. Labels can be strings or integers that serve as a mnemonic for different conceptual messages in the collection.

The goal of being for either an aggregate attester OR a CMW aggregator should account for the multiple allowed wire formats. I don't see that without either

a. a tree format for CMW-collection with mixed representation b. change the current map format to an array to remove any implied meaning for labels.

thomas-fossati commented 5 months ago

[n]ow I need to specify an output format for the binaries that isn't the standard CMW / CMW-collection.

The daemon and its plugins all live under the same system roof, isn't it?

Can't each binary return evidence in the native format using something equivalent to:

int collect_evidence(
    const uint8_t nonce[64],
    uint8_t **evidence, size_t *evidence_sz,
    struct media_type *mt
);

thomas-fossati commented 5 months ago

a. a tree format for CMW-collection

this one is trivial:

cmw-collection = {
  + cmw-collection-entry-label => cmw / cmw-collection
}

with mixed representation

I am afraid this one isn't.

nedmsmith commented 5 months ago


cmw-collection = {

  + cmw-collection-entry-label => cmw / cmw-collection

}

+1

thomas-fossati commented 5 months ago

An alternative (put forward by @nedmsmith) is to squash the productions:

cmw = cmw-array / cmw-tag / cmw-collection

deeglaze commented 5 months ago

I still don't see how nesting the trees is going to help if some output trees are JSON and others are CBOR and we need to aggregate into a single format. If we can register a tag that implies the bytes within constitute a JSON CMW-collection, then we can say

cmw = cmw-array / JC<base64-string, cmw-tag> / cmw-collection

A tag in CBOR can swap to JSON, and a baes64-string in JSON can be interpreted as a serialized CBOR cmw-collection

thomas-fossati commented 5 months ago

I still don't see how nesting the trees is going to help if some output trees are JSON and others are CBOR and we need to aggregate into a single format

You are right, it doesn't help with mixtures of formats.

If we can register a tag that implies the bytes within constitute a JSON CMW-collection, then we can say
cmw = cmw-array / JC<base64-string, cmw-tag> / cmw-collection
A tag in CBOR can swap to JSON, and a baes64-string in JSON can be interpreted as a serialized CBOR cmw-collection

I am not sure that will work as written, but the gist is clear: we need tunnels :-)

And given the tunnelling is symmetrical we might use a common shape by registering a couple of new media types (since tags are not a JSON thing):

j2c-tunnel = [ "application/cmw-j2c-tunnel", text .b64u json-CMW ]
c2j-tunnel = [ "application/cmw-c2j-tunnel", text .b64u cbor-CMW ]

where:

json-CMW = array / collection
cbor-CMW = tag / array / collection

and the tunneling procedures are:

:mountain_railway: JSON-to-CBOR CMW :mountain_railway:

input: the UTF-8 string with the serialised JSON CMW
apply b64urlsafe encoding
stick it into the second slot of the j2c-tunnel tuple
output: the j2c-tunnel tuple as a CBOR array

:mountain_railway: CBOR-to-JSON CMW :mountain_railway:

input: the byte array with the serialised CBOR CMW
apply b64urlsafe encoding
stick it into the second slot of the c2j-tunnel
output: the c2j-tunnel tuple as a JSON array

deeglaze commented 5 months ago

Why does the JSON to CBOR tunnel need to be base64 encoded? The cmw-array value type is bytes, not text.

thomas-fossati commented 5 months ago

Why does the JSON to CBOR tunnel need to be base64 encoded?

I made it like that purely for aesthetic/symmetry reasons.

deeglaze commented 5 months ago

The symmetry breaks the current cmw-array schema and adds encoding bloat for aesthetics, so I'd recommend keeping it to bytes.

thomas-fossati commented 5 months ago

To be clear, I have no strong opinions. We could do:

j2c-tunnel = [ "application/cmw-j2c-tunnel", text .json json-CMW ]
c2j-tunnel = [ "application/cmw-c2j-tunnel", text .b64u cbor-CMW ]

and change the tunnelling procedure accordingly.

thomas-fossati commented 5 months ago

The symmetry breaks the current cmw-array schema

note that the j2c-tunnel array is not strictly a cmw-array; we can do whatever we want with that without fear of breaking anything.

so I'd recommend keeping it to bytes.

WFM

deeglaze commented 5 months ago

Is the idea for this to not match cmw-array and instead be a different alternate? That changes the parsing complexity from LL(1) to LL(2), no? I would think we have this be a media type for cmw-array and a new cm-type bit for "tunnel"?

thomas-fossati commented 5 months ago

Relevant to this discussion, one of @carl-wallace's WGLC comments in https://mailarchive.ietf.org/arch/msg/rats/xY2mwu790UOGnhFAUduGj5ddo3Y/

"I think this came up relative to the collections draft a while back but I forget how it was handled (and did not go looking just now). How would one encode artifacts that use different encoding types, i.e., a CBOR evidence and a JSON result? The collection concept is analogous to the submodules part of EAT, and that addresses the various nesting possibilities."

thomas-fossati commented 5 months ago

Is the idea for this to not match cmw-array and instead be a different alternate? That changes the parsing complexity from LL(1) to LL(2), no? I would think we have this be a media type for cmw-array and a new cm-type bit for "tunnel"?

The complete (loose) grammar I have in mind is this:

cmw = json-CMW / cbor-CMW

json-CMW = json-array / json-collection
cbor-CMW = cbor-array / cbor-collection / cbor-tag

c2j-tunnel = [ "#cmw-c2j-tunnel", text ]
j2c-tunnel = [ "#cmw-j2c-tunnel", bytes ]

json-array = [ text, text ]
json-collection = { + text => json-CMW / c2j-tunnel }

cbor-array = [ uint / text, bytes ]
cbor-collection = { + (int / text) => cbor-CMW / j2c-tunnel }
cbor-tag = #6.<0..18446744073709551615>(bytes)

Updated according to https://github.com/ietf-rats-wg/draft-ietf-rats-msg-wrap/issues/55#issuecomment-1934673175

deeglaze commented 5 months ago

I think we do have to specify that the media type in a cmw-array cannot be "application/cmw-j2c-tunnel" (conversely for JSON), OR we have to say that ind may not be present in a cmw-array if the media type is "application/cmw-j2c-tunnel". Right now the alternates parse ambiguously.

thomas-fossati commented 5 months ago

I think we do have to specify that the media type in a cmw-array cannot be "application/cmw-j2c-tunnel" (conversely for JSON), OR we have to say that ind may not be present in a cmw-array if the media type is "application/cmw-j2c-tunnel". Right now the alternates parse ambiguously.

Yes, absolutely. The two tunnel media types are effectively magic numbers.

thomas-fossati commented 5 months ago

I think we do have to specify that the media type in a cmw-array cannot be "application/cmw-j2c-tunnel" (conversely for JSON), OR we have to say that ind may not be present in a cmw-array if the media type is "application/cmw-j2c-tunnel". Right now the alternates parse ambiguously.

Yes, absolutely. The two tunnel media types are effectively magic numbers.

An alternative would be to use as a magic number something that doesn't parse as a media-type string (i.e., anything (!ALPHA && !DIGIT)), say "!" for CBOR and "#" for JSON, or less concisely but a bit more self-descriptively "#cmw-c2j-tunnel" and "#cmw-j2c-tunnel".)

This would also spare us from registering two mostly useless new media types :-)

ietf-rats-wg / draft-ietf-rats-msg-wrap

Request for clarification: array length and mixed encoding #55

:mountain_railway: JSON-to-CBOR CMW :mountain_railway:

:mountain_railway: CBOR-to-JSON CMW :mountain_railway: