Multiformats Considered Harmful

selfissued commented 1 year ago

While I usually reserve my time and energy for advancing good ideas, I’m making an exception to publicly state the reasons why I believe “multiformats” should not be considered for standardization by the IETF.

Multiformats institutionalize the failure to make a choice, which is the opposite of what good standards do. Good standards make choices about representations of data structures resulting in interoperability, since every conforming implementation uses the same representation. In contrast, multiformats enable different implementations to use a multiplicity of different representations for the same data, harming interoperability. https://datatracker.ietf.org/doc/html/draft-multiformats-multibase-03#appendix-D.1 defines 23 equivalent and non-interoperable representations for the same data!
The stated purpose of “multibase” is “Unfortunately, it’s not always clear what base encoding is used; that’s where this specification comes in. It answers the question: Given data ‘d’ encoded into text ‘s’, what base is it encoded with?”, which is wholly unnecessary. Successful standards DEFINE what encoding is used where. For instance, https://www.rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2 defines that “x” is base64url encoded. No guesswork or prefixing is necessary or useful.
Standardization of multiformats would result in unnecessary and unhelpful duplication of functionality – especially of key representations. The primary use of multiformats is for “publicKeyMultibase” – a representation of public keys that are byte arrays. For instance, the only use of multiformats by the W3C DID spec is for publicKeyMultibase. The IETF already has several perfectly good key representations, including X.509, JSON Web Key (JWK), and COSE_Key. There’s not a compelling case for another one.
publicKeyMultibase can only represent a subset of the key types used in practice. Representing many kinds of keys requires multiple values – for instance, RSA keys require both an exponent and a modulus. By comparison, the X.509, JWK, and COSE_Key formats are flexible enough to represent all kinds of keys. It makes little to no sense to standardize a key format that limits implementations to only certain kinds of keys.
The “multihash” specification relies on a non-standard representation of integers called “Dwarf”. Indeed, the referenced Dwarf document lists itself as being at http://dwarf.freestandards.org/ – a URL that no longer exists!
The “Multihash Identifier Registry” at https://www.ietf.org/archive/id/draft-multiformats-multihash-07.html#mh-registry duplicates the functionality of the IANA “Named Information Hash Algorithm Registry” at https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg, in that both assign (different) numeric identifiers for hash functions. If multihash goes forward, it should use the existing registry.
It’s concerning that the draft charter states that “Changing current Multiformat header assignments in a way that breaks backward compatibility with production deployments” is out of scope. Normally IETF working groups are given free rein to make improvements during the standardization process.
Finally, as a member of the W3C DID and W3C Verifiable Credentials working groups, I will state that it is misleading for the draft charter to say that “The outputs from this Working Group are currently being used by … the W3C Verifiable Credentials Working Group, W3C Decentralized Identifiers Working Group…”. The documents produced by these working groups intentionally contain no normative references to multiformats or any data structures derived from them. Where they are referenced, it is explicitly stated that the references are non-normative.

zamicol commented 1 year ago

I respect the first principles engineering work multiformats demonstrates.

I didn't realize that a public key specification was included. Where is publicKeyMultibase defined?

EDIT: msporny pointed me in the right direction:

publicKeyMultibase is an encoding of Multikey. The Multikey format is described for each primitive in the their respective W3C specification, for example, eddsa and ecdsa.

BigBlueHat commented 1 year ago

@zamicol publicKeyMultibase is defined in the DID-CORE spec.

zamicol commented 1 year ago

In that document I see publicKeyMultibase referred to, but I don't see a definition. Just as publicKeyJwk is referred to, but JWK is defined by it's own specification. The DID-CORE spec links to the appropriate specification for JWK, but I don't see any such link for Multibase's "public key". Where is Multibase's public key defined?

AaronGoldman commented 1 year ago

1.

Multiformats institutionalize the failure to make a choice, which is the opposite of what good standards do. Good standards make choices about representations of data structures resulting in interoperability, since every conforming implementation uses the same representation. In contrast, Multiformats enable different implementations to use a multiplicity of different representations for the same data, harming interoperability. datatracker.ietf.org/doc/html/draft-Multiformats-Multibase-03#appendix-D.1 defines 23 equivalent and non-interoperable representations for the same data!

Multibase specifically and Multiformats more generally are standards for decoupling. A good example of a decoupling standard is IPv4/IPv6 and the IP protocol numbers. IPv4 has Protocol and IPv6 has the Next Header but they share the same IANA registry. We could call this a "failure to make a choice" as IP did not choose the format of the layers above and below IP, or we could view it as a deliberate decoupling of the layers of the network stack. Whether it was a good or bad design, it did enable innovation in what types of content IP is capable of encapsulating. There are 146 protocols in the registry and some routers don't implement them all, just preferring ICMP, UDP, and TCP but IPv4/IPv6 have still proved useful.

The Multibase standard solves the problem of representing bytes in text strings with restricted character sets, without needing to know in advance what the restrictions will be. This is independent and separate from all the other Multiformat standards.

The Multiformat standard solves the problem of providing a "tag" to specify what the next "value" is, same as IPv4's Protocol header or HTTP's Content-Type header.

2.

The stated purpose of "Multibase" is "Unfortunately, it's not always clear what base encoding is used; that's where this specification comes in. It answers the question: Given data ‘d' encoded into text ‘s', what base is it encoded with?", which is wholly unnecessary. Successful standards DEFINE what encoding is used where. For instance, rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2 defines that "x" is base64url encoded. No guesswork or prefixing is necessary or useful.

Some standards do specify a specific encoding. Multibase will not prevent any past or future standard from specifying that a text field is Base64url, for example. It dose enables future standards to specify that bytes are encoded as a Multibase string.

Multibase is a set of encodings that will allow an array of bytes to be encoded as text with restriction on character set that may not always be known in advance. If we had a protocol that had a 32-byte number, and we needed to represent those bytes as text, we could represent them as:

Base	Literal
b256(bytes)	(non-ascii bytes not representable here)
b85	<FLd+nEV_Rn)~#~nQyryC$2%{WSf&rq?MT)cv84k
b64	47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
b32	4OYMIQUY7QOBJGX36TEJS35ZEQT24QPEMSNZGTFESWMRW6CSXBKQ====
b16	E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
integer_literal	0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
integer_literal	102987336249554097029535212322581322789799900648198034993379397001115665086549
integer_literal	0o16166061041230770160244657576462114557562220475344074431115623231222254621557024534125
integer_literal	0b1110001110110000110001000100001010011000111111000001110000010100100110101111101111110100110010001001100101101111101110010010010000100111101011100100000111100100011001001001101110010011010011001010010010010101100110010001101101111000010100101011100001010101

By using an integer literal, I can both describe the number and the base that the number is represented in. In this case, we represent hex in a text that only needs to be able to support 0123456789abcdefx, binary with just 01b, and so on. Multibase takes this further by requiring that the first byte (indicating the base) is one of the bytes from the alphabet of the encoding. This way we don't add a character requirement for no value.

An example of this adding value is when Multibase was chosen for IPFS CIDs. The CIDs were traditionally in base58btc, which is case-sensitive. This worked well for representing bytes in the restricted text environment of file paths and URI paths. This could have easily been specified as a base58btc string, but fortunately they chose Multibase to decouple the bytes of the CID from the string representation. When the time came that they wanted to put CIDs into subdomains, the case-insensitive subdomains were a more restricted text environment that they had not anticipated. They switched to base32 which was not case-sensitive and thus able to represent the same bytes in a more restricted environment.

Multibase is orthogonal to Multiformats and should be standardized as a way to represent bytes in a restricted text environment that is restricted in ways that are irrelevant to the bytes being represented. If we don't know whether our data will need to be represented as compact arbitrary bytes, 7-bit safe ascii, JSON non-escaped ascii, CSV non-escaped ascii, TSV non-escaped ascii, URL path-safe ascii, domain-name-safe ascii, decimal numbers only, some not yet known but soon to be important environment, etc. then encoding the bytes as Multibase has decoupling value.

3.

Standardization of Multiformats would result in unnecessary and unhelpful duplication of functionality – especially of key representations. The primary use of Multiformats is for "publicKeyMultibase" – a representation of public keys that are byte arrays. For instance, the only use of Multiformats by the W3C DID spec is for publicKeyMultibase. The IETF already has several perfectly good key representations, including X.509, JSON Web Key (JWK), and COSE_Key. There's not a compelling case for another one.

The standardization of Multiformats is independent of whether IETF chooses to standardize publicKeyMultibase.

For example, the IPv4 Protocol header registers 70 VISA VISA Protocol. This does not imply that IETF needs to specify VISA Protocol. In fact, as far as we can tell, it is the IVI Foundation that maintains that standard. In the exact same way, the only interaction between Multiformats standardization and publicKeyMultibase is that publicKeyMultibase could use the Multiformats registry to map numbers to key representations. Any flaws in publicKeyMultibase are no better an argument against standardization of Multiformats than the flaws in VISA Protocol are against standardization of IPv4 and the IANA protocol-numbers registry.

If X.509, JSON Web Key (JWK), or COSE_Key become the standard way to represent keys for the web then publicKeyMultibase could just add a Multiformats registry entry for X.509 or JWK, and publicKeyMultibase would just be a wrapper around those representations. COSE is already present in the registry.

4.

publicKeyMultibase can only represent a subset of the key types used in practice. Representing many kinds of keys requires multiple values – for instance, RSA keys require both an exponent and a modulus. By comparison, the X.509, JWK, and COSE_Key formats are flexible enough to represent all kinds of keys. It makes little to no sense to standardize a key format that limits implementations to only certain kinds of keys.

Please see above. publicKeyMultibase is outside the scope of this working group, which is tasked with producing the following artifacts:

An RFC specifying multibase usage

An RFC defining an independent multibase registry and populating it with today's already-implemented stable and final values

An RFC defining a registry-group for all the multicodecs, empty at inception, with registration process and group-wide constraints on registration values

An RFC specifying multihash usage

An RFC defining a multihash registry within the multicodecs registry group and populating it with today's already-implemented stable and final values

The Multiformat-varint spec is also pulled in as it is needed to specify the length in Multihash and Multiformat with sized payloads.

5.

The "multihash" specification relies on a non-standard representation of integers called "Dwarf". Indeed, the referenced Dwarf document lists itself as being at http://dwarf.freestandards.org – a URL that no longer exists!

We agree here - the Multiformats-varint is close to but not exactly Dwarf. This is due to the fact that the Multiformats-varint is limited to 9 bytes. It is a 1-to-9 byte representation of an unsigned int63. from 0x00(0) to 0x7FFFFFFF_FFFFFFFF(9223372036854775807) this means the decoded value will always fit in either a signed int64 or an unsigned int64. If the most-significant-bit of a byte is 0, this is the last byte of the Multiformats-varint. If it is 1, there is at least one more byte present in the Multiformats-varint. The 7 remaining bits are the payload bits. You can shift the payload bits left by 7 * (byte number) and | (bitwise-OR) them in to get the decoded number.

| length in bytes | Encoded bits | Bits                                                                             |
|-----------------|--------------|----------------------------------------------------------------------------------|
| 1               | 7            | 0xxxxxxx                                                                         |
| 2               | 14           | 1xxxxxxx 0xxxxxxx                                                                |
| 3               | 21           | 1xxxxxxx 1xxxxxxx 0xxxxxxx                                                       |
| 4               | 28           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                                              |
| 5               | 35           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                                     |
| 6               | 42           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                            |
| 7               | 49           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                   |
| 8               | 56           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx          |
| 9               | 63           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
|                 |              |  7     0, 14    8, 21   15, 28   22, 35   23, 42   36, 49   43, 56   50, 63   57 |

Multiformats-varint is such a simple varint that there is no reason to point anywhere else. The Multiformats-varint should be specified by this working group alongside Multibase and Multihash. Any reference to Dwarf is simply unnecessary as it is clearer to specify Multiformats-varint rather than trying to describe it relative to a similar but non-identical varint.

6.

The "Multihash Identifier Registry" at ietf.org/archive/id/draft-Multiformats-multihash-07.html#mh-registry duplicates the functionality of the IANA "Named Information Hash Algorithm Registry" at iana.org/assignments/named-information/named-information.xhtml#hash-alg, in that both assign (different) numeric identifiers for hash functions. If multihash goes forward, it should use the existing registry.

"Not all uses of these names require use of the full hash output -- truncated hashes can be safely used in some environments. For this reason, we define a new IANA registry for hash functions to be used with this specification so as not to mix strong and weak (truncated) hash algorithms in other protocol registries." -- rfc6920: Naming Things with Hashes

The goal of the named-information registry is to be a hash function and prefix length for the binary encoding of a ni:// or a nih://. This is limited to a 6-bit field but the Multiformats registry intends to support more than 64 algorithm/size pairs.

hash	sizes
identity	1
sha1	1
sha2	9
sha2a	1
sha3	4
keccak	5
blake3	1
md4	1
md5	1
blake2b	64
blake2s	32
skein256	32
skein512	64
skein1024	128

We can't fit hundreds of hash function length pairs in a 64-entry registry. This would break backwards compatibility because it changes which numbers match which hash functions. It pollutes the registry for rfc6920 implementors by including non-cryptographically secure hash functions. Lastly, the Multiformats registry already contains more than 64 hash functions and would not fit in the Named Information Hash Algorithm Registry.

It is better to have hash function and length as two different fields as in Multihash.

7.

It's concerning that the draft charter states that "Changing current Multiformat header assignments in a way that breaks backward compatibility with production deployments" is out of scope. Normally IETF working groups are given free rein to make improvements during the standardization process.

This may be a distinction without a difference. We certainly could empower the working group to make backwards incompatible changes, but they will try not to have any unnecessary breaking changes.

8.

Finally, as a member of the W3C DID and W3C Verifiable Credentials working groups, I will state that it is misleading for the draft charter to say that "The outputs from this Working Group are currently being used by … the W3C Verifiable Credentials Working Group, W3C Decentralized Identifiers Working Group…". The documents produced by these working groups intentionally contain no normative references to Multiformats or any data structures derived from them. Where they are referenced, it is explicitly stated that the references are non-normative.

This is a good note. The draft charter should probably be clear that Multiformats are being used in Verifiable Credentials and Decentralized Identifiers in production. There are multiple existing independent implementations of this technology enabling Verifiable Credentials and Decentralized Identifiers to be useful. While these specs contain no normative references, this registry provides the ability to make Verifiable Credentials and Decentralized Identifiers that are better decoupled from the data structures that they contain, and will therefore be flexible in the face of future evolution.

msporny / charter-ietf-multiformats

Multiformats Considered Harmful #2

1.

2.

3.

4.

5.

6.

7.

8.