Open selfissued opened 1 year ago
I respect the first principles engineering work multiformats demonstrates.
I didn't realize that a public key specification was included. Where is publicKeyMultibase
defined?
EDIT: msporny pointed me in the right direction:
publicKeyMultibase
is an encoding of Multikey
. The Multikey
format is described for each primitive in the their respective W3C specification, for example, eddsa and ecdsa.
@zamicol publicKeyMultibase
is defined in the DID-CORE spec.
In that document I see publicKeyMultibase
referred to, but I don't see a definition. Just as publicKeyJwk
is referred to, but JWK is defined by it's own specification. The DID-CORE spec links to the appropriate specification for JWK, but I don't see any such link for Multibase's "public key". Where is Multibase's public key defined?
Multiformats institutionalize the failure to make a choice, which is the opposite of what good standards do. Good standards make choices about representations of data structures resulting in interoperability, since every conforming implementation uses the same representation. In contrast, Multiformats enable different implementations to use a multiplicity of different representations for the same data, harming interoperability. datatracker.ietf.org/doc/html/draft-Multiformats-Multibase-03#appendix-D.1 defines 23 equivalent and non-interoperable representations for the same data!
Multibase specifically and Multiformats more generally are standards for decoupling. A good example of a decoupling
standard is IPv4/IPv6
and the IP protocol numbers. IPv4 has Protocol
and
IPv6 has the Next Header
but they share the same IANA registry.
We could call this a "failure to make a choice" as IP did not choose the format of the layers above and below IP, or
we could view it as a deliberate decoupling of the layers of the network stack. Whether it was a good or bad design,
it did enable innovation in what types of content IP is capable of encapsulating. There are 146 protocols in the
registry and some routers don't implement them all, just preferring ICMP, UDP, and TCP but IPv4/IPv6 have still
proved useful.
The Multibase standard solves the problem of representing bytes in text strings with restricted character sets, without needing to know in advance what the restrictions will be. This is independent and separate from all the other Multiformat standards.
The Multiformat standard solves the problem of providing a "tag" to specify what the next "value" is, same as IPv4's
Protocol
header or HTTP's Content-Type
header.
The stated purpose of "Multibase" is "Unfortunately, it's not always clear what base encoding is used; that's where this specification comes in. It answers the question: Given data ‘d' encoded into text ‘s', what base is it encoded with?", which is wholly unnecessary. Successful standards DEFINE what encoding is used where. For instance, rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2 defines that "x" is base64url encoded. No guesswork or prefixing is necessary or useful.
Some standards do specify a specific encoding. Multibase will not prevent any past or future standard from specifying
that a text field is Base64url
, for example. It dose enables future standards to specify that bytes are encoded as a
Multibase string.
Multibase is a set of encodings that will allow an array of bytes to be encoded as text with restriction on character set that may not always be known in advance. If we had a protocol that had a 32-byte number, and we needed to represent those bytes as text, we could represent them as:
Base | Literal |
---|---|
b256(bytes) | (non-ascii bytes not representable here) |
b85 | <FLd+nEV_Rn)~#~nQyryC$2%{WSf&rq?MT)cv84k |
b64 | 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= |
b32 | 4OYMIQUY7QOBJGX36TEJS35ZEQT24QPEMSNZGTFESWMRW6CSXBKQ==== |
b16 | E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855 |
integer_literal | 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 |
integer_literal | 102987336249554097029535212322581322789799900648198034993379397001115665086549 |
integer_literal | 0o16166061041230770160244657576462114557562220475344074431115623231222254621557024534125 |
integer_literal | 0b1110001110110000110001000100001010011000111111000001110000010100100110101111101111110100110010001001100101101111101110010010010000100111101011100100000111100100011001001001101110010011010011001010010010010101100110010001101101111000010100101011100001010101 |
By using an integer literal, I can both describe the number and the base that the number is represented in. In this
case, we represent hex in a text that only needs to be able to support 0123456789abcdefx
, binary with just 01b
, and
so on. Multibase takes this further by requiring that the first byte (indicating the base) is one of the bytes from the
alphabet of the encoding. This way we don't add a character requirement for no value.
An example of this adding value is when Multibase was chosen for IPFS CIDs. The CIDs were traditionally in base58btc
,
which is case-sensitive. This worked well for representing bytes in the restricted text environment of file paths and
URI paths. This could have easily been specified as a base58btc
string, but fortunately they chose Multibase to
decouple the bytes of the CID from the string representation. When the time came that they wanted to put CIDs into
subdomains, the case-insensitive subdomains were a more restricted text environment that they had not anticipated. They
switched to base32
which was not case-sensitive and thus able to represent the same bytes in a more restricted
environment.
Multibase is orthogonal to Multiformats and should be standardized as a way to represent bytes in a restricted text environment that is restricted in ways that are irrelevant to the bytes being represented. If we don't know whether our data will need to be represented as compact arbitrary bytes, 7-bit safe ascii, JSON non-escaped ascii, CSV non-escaped ascii, TSV non-escaped ascii, URL path-safe ascii, domain-name-safe ascii, decimal numbers only, some not yet known but soon to be important environment, etc. then encoding the bytes as Multibase has decoupling value.
Standardization of Multiformats would result in unnecessary and unhelpful duplication of functionality – especially of key representations. The primary use of Multiformats is for "publicKeyMultibase" – a representation of public keys that are byte arrays. For instance, the only use of Multiformats by the W3C DID spec is for publicKeyMultibase. The IETF already has several perfectly good key representations, including X.509, JSON Web Key (JWK), and COSE_Key. There's not a compelling case for another one.
The standardization of Multiformats is independent of whether IETF chooses to standardize publicKeyMultibase
.
For example, the IPv4 Protocol
header registers 70
VISA
VISA Protocol
. This does not imply that IETF needs to
specify VISA Protocol. In fact, as far as we
can tell, it is the IVI Foundation that maintains that standard. In the exact same way, the only interaction between
Multiformats standardization and publicKeyMultibase
is that publicKeyMultibase
could use the Multiformats
registry to map numbers to key representations. Any flaws in publicKeyMultibase
are no better an argument against
standardization of Multiformats than the flaws in VISA Protocol are against standardization of IPv4 and the IANA
protocol-numbers registry.
If X.509, JSON Web Key (JWK), or COSE_Key become the standard way to represent keys for the web then publicKeyMultibase
could just add a Multiformats registry entry for X.509 or JWK, and publicKeyMultibase
would just be a wrapper around
those representations. COSE is already present in the registry.
publicKeyMultibase can only represent a subset of the key types used in practice. Representing many kinds of keys requires multiple values – for instance, RSA keys require both an exponent and a modulus. By comparison, the X.509, JWK, and COSE_Key formats are flexible enough to represent all kinds of keys. It makes little to no sense to standardize a key format that limits implementations to only certain kinds of keys.
Please see above. publicKeyMultibase
is outside the scope of this working group, which is tasked
with producing the following artifacts:
- An RFC specifying multibase usage
- An RFC defining an independent multibase registry and populating it with today's already-implemented stable and final values
- An RFC defining a registry-group for all the multicodecs, empty at inception, with registration process and group-wide constraints on registration values
- An RFC specifying multihash usage
- An RFC defining a multihash registry within the multicodecs registry group and populating it with today's already-implemented stable and final values
The Multiformat-varint spec is also pulled in as it is needed to specify the length in Multihash and Multiformat with sized payloads.
The "multihash" specification relies on a non-standard representation of integers called "Dwarf". Indeed, the referenced Dwarf document lists itself as being at http://dwarf.freestandards.org – a URL that no longer exists!
We agree here - the Multiformats-varint is close to but not exactly Dwarf. This is due to the fact that the
Multiformats-varint is limited to 9 bytes. It is a 1-to-9 byte representation of an unsigned int63. from 0x00(0)
to 0x7FFFFFFF_FFFFFFFF(9223372036854775807) this means the decoded value will always fit in either a signed int64 or
an unsigned int64. If the most-significant-bit of a byte is 0, this is the last byte of the Multiformats-varint. If it
is 1, there is at least one more byte present in the Multiformats-varint. The 7 remaining bits are the payload bits.
You can shift the payload bits left by 7 * (byte number)
and |
(bitwise-OR) them in to get the decoded number.
| length in bytes | Encoded bits | Bits |
|-----------------|--------------|----------------------------------------------------------------------------------|
| 1 | 7 | 0xxxxxxx |
| 2 | 14 | 1xxxxxxx 0xxxxxxx |
| 3 | 21 | 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| 4 | 28 | 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| 5 | 35 | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| 6 | 42 | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| 7 | 49 | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| 8 | 56 | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| 9 | 63 | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
| | | 7 0, 14 8, 21 15, 28 22, 35 23, 42 36, 49 43, 56 50, 63 57 |
Multiformats-varint is such a simple varint that there is no reason to point anywhere else. The Multiformats-varint should be specified by this working group alongside Multibase and Multihash. Any reference to Dwarf is simply unnecessary as it is clearer to specify Multiformats-varint rather than trying to describe it relative to a similar but non-identical varint.
The "Multihash Identifier Registry" at ietf.org/archive/id/draft-Multiformats-multihash-07.html#mh-registry duplicates the functionality of the IANA "Named Information Hash Algorithm Registry" at iana.org/assignments/named-information/named-information.xhtml#hash-alg, in that both assign (different) numeric identifiers for hash functions. If multihash goes forward, it should use the existing registry.
"Not all uses of these names require use of the full hash output -- truncated hashes can be safely used in some environments. For this reason, we define a new IANA registry for hash functions to be used with this specification so as not to mix strong and weak (truncated) hash algorithms in other protocol registries." -- rfc6920: Naming Things with Hashes
The goal of the named-information registry is to be a hash function and prefix length for the binary encoding of a
ni://
or a nih://
. This is limited to a 6-bit field but the Multiformats registry intends to support more than 64
algorithm/size pairs.
hash | sizes |
---|---|
identity | 1 |
sha1 | 1 |
sha2 | 9 |
sha2a | 1 |
sha3 | 4 |
keccak | 5 |
blake3 | 1 |
md4 | 1 |
md5 | 1 |
blake2b | 64 |
blake2s | 32 |
skein256 | 32 |
skein512 | 64 |
skein1024 | 128 |
We can't fit hundreds of hash function length pairs in a 64-entry registry. This would break backwards compatibility because it changes which numbers match which hash functions. It pollutes the registry for rfc6920 implementors by including non-cryptographically secure hash functions. Lastly, the Multiformats registry already contains more than 64 hash functions and would not fit in the Named Information Hash Algorithm Registry.
It is better to have hash function and length as two different fields as in Multihash.
It's concerning that the draft charter states that "Changing current Multiformat header assignments in a way that breaks backward compatibility with production deployments" is out of scope. Normally IETF working groups are given free rein to make improvements during the standardization process.
This may be a distinction without a difference. We certainly could empower the working group to make backwards incompatible changes, but they will try not to have any unnecessary breaking changes.
Finally, as a member of the W3C DID and W3C Verifiable Credentials working groups, I will state that it is misleading for the draft charter to say that "The outputs from this Working Group are currently being used by … the W3C Verifiable Credentials Working Group, W3C Decentralized Identifiers Working Group…". The documents produced by these working groups intentionally contain no normative references to Multiformats or any data structures derived from them. Where they are referenced, it is explicitly stated that the references are non-normative.
This is a good note. The draft charter should probably be clear that Multiformats are being used in Verifiable Credentials and Decentralized Identifiers in production. There are multiple existing independent implementations of this technology enabling Verifiable Credentials and Decentralized Identifiers to be useful. While these specs contain no normative references, this registry provides the ability to make Verifiable Credentials and Decentralized Identifiers that are better decoupled from the data structures that they contain, and will therefore be flexible in the face of future evolution.
While I usually reserve my time and energy for advancing good ideas, I’m making an exception to publicly state the reasons why I believe “multiformats” should not be considered for standardization by the IETF.