icdevs / ICEventsWG

WG for developing an Event System(pub/sub) on the Internet Computer
3 stars 3 forks source link

Adopt CBOR as a standard binary encoding format for message payloads #9

Open IcariaSystems opened 5 months ago

IcariaSystems commented 5 months ago

Proposal to adopt CBOR 1 as a standard binary encoding & encapsulation format for message payloads. I know it is used already in a number of IC code libraries & standards already.

Pros would be (a) wide language support including Rust, Motoko(?), JS, python for canisters and just about any other used for off-chain clients, also many developer tools support inspection; (b) easy to encode JSON data into compact and canonical form for transport (so deterministic data structure removing any text formatting variances of same data content); (c) 1-4 byte Semantic Tags used in front of encoded binary data structures which can be used to either completely embed schema information or as a very compact reference to an external schema.

It is possible that open extension to the CBOR standard semantic tags dictionary can be done for identified IC payload data schemas e.g. we could formally assign a 2-byte CBOR semantic tag for each ICRC-X defined data schemas.

Also there is an assigned 1-byte tag for “CBOR data” which can be used as a ‘magic number’ for the first byte of a message payload to distinguish it trivially from a text payload or other binary message format.

Gekctek commented 5 months ago

There is Motoko support https://mops.one/cbor so that shouldnt be an issue

I agree, I like CBOR as at least the default

lachlanw commented 5 months ago

I have read and thought about this more. I think that a recommended binary encoding is a good idea BUT it would be better to have a registry of well-known bytecode formats that are well-supported in IC CDKs. A revised proposal for "standard" byte-code formats for Blob message content for discussion: a) identify a limited number (may be added to by agreement) of well-known, well-specified byte encoding formats for enhanced support; e.g. UTF-8, CANDID, CBOR, AVRO, ... b) specify read-only interface function that identifies the binary message encoding type with a short (4-byte?) Blob type code; c) the encoding type is read/derived from the leading bytes of the message binary content by way of a "magic-number" or other standard prefix used by that encoding (many standard byte encodings support some form of identifying prefix). for example: ByteCode Blob_type_code => Content prefix AVRO: AVRO => O b j 0x01 https://avro.apache.org/docs/1.11.1/specification/#object-container-files CBOR: CBOR => 0xd9 0xd9 0xf7 0x?? https://www.rfc-editor.org/rfc/rfc8949.html#name-self-described-cbor CANDID: DIDL => D I D L https://github.com/dfinity/candid/blob/master/spec/Candid.md#parameters-and-results

In general CBOR can be used to prefix-tag many data formats that might be used as content for a message. There is an extensible catalogue of CBOR "semantic tags" registered which can be used to further specify the binary (or text) content type of the message if they follow the standard 3-byte prefix used to identify a CBOR message as shown above (see https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml ) for example: UTF8 TEXT: UTF8 => 0xd9 0xd9 0xf7 0xff

Any textual format for structured data that uses UTF-8 as a binary byte encoding can usually be identified from a header line. For example: JSON TEXT: UTF8+YAML => { OR [ https://www.rfc-editor.org/rfc/rfc7493.html#section-4.1 XML TEXT: _UTF8+XML_ => <?xml version="1.0" encoding="UTF-8"?> YAML TEXT: UTF8+YAML => %YAML 1.2 https://yaml.org/

This idea clearly needs to be discussed and a specification of how to get (and possibly set) the binary type code before being voted upon. Comments and thoughts welcome.

Gekctek commented 5 months ago

Could the binary encoding be specified in the new config map? and if not specified we could have a default like CBOR?

lachlanw commented 5 months ago

Could the binary encoding be specified in the new config map?

That's a good idea. If the publisher of the event sets a binary-encoding type specifier in the config map then any consumer can read that to decide how to decode or otherwise process the Blob. It would not be a guarantee that the Blob is in fact well-formed data for the specified type (because mistakes and errors can occur during publishing or relay).

if not specified we could have a default like CBOR?

I was thinking the unspecified default would actually be "DIDL" to indicate that any message is by default CANDID wire format encoded for inter-canister messaging. That may be interpreted as the message data type is any value except Blob. Then if Blob is the type used by publisher then byte-code type specifier must be set; CBOR should be used be default as it support compact and semantically tagged structured binary data that is self-describing (no external schema required to parse a lot of the content)

lachlanw commented 4 months ago

Before drafting a (small) addition to ICRC-72 draft to address the indication of binary format for the encoding of Blob mesage data type, can I confirm that this is the section we would describe a config entry for : https://github.com/icdevs/ICEventsWG/blob/main/Meetings/20240529/icrc72draft.md#publications-configs