w3c-ccg / multibase

An IETF Internet Draft for the Multibase data format
https://w3c-ccg.github.io/multibase/
Other
4 stars 4 forks source link

Not all identifiers fit in a single byte #4

Open jyasskin opened 1 year ago

jyasskin commented 1 year ago

https://datatracker.ietf.org/doc/html/draft-multiformats-multibase-07#name-the-multibase-format says

The encoding algorithm is a single character value that is always the first byte of the data.

However, the Multibase Algorithms Registry includes two entries where that's either not true or not obviously true.

This should all be consistent and clearly identify the initial byte values that identify the various algorithms.

jyasskin commented 1 year ago

To make this clear, it could make sense to adopt the convention in https://infra.spec.whatwg.org/#bytes, and identify all bytes with their hexadecimal value even if they're printable ASCII characters. So:

Algorithm Identifier byte (character) Status Specification
identity 0x00 (NUL) active 8-bit binary (encoder and decoder keeps data unmodified)
base2 0x30 (0) active binary (01010101)

etc.

If you intend to allow byte sequences, https://infra.spec.whatwg.org/#byte-sequences provides a convention for that.

bumblefudge commented 1 year ago

After some research and discussions with others in the space, I think the "always the first byte" comment is actually false-- it's contradicted pointedly by the somewhat glibly stated "z is z" comment about "encoding-independence" below the table here in the pre-IETF spec and in the original FAQ.

After talking to some people, I realized that the namespace of registrations is not single [UTF-8] bytes, which one might expect since the rest of the Multiformats sub-registries are organized around the namespace of UTF-8 bytecodes. Instead, the multibase codes are each a single UNICODE CODE POINT in the post-encoding string expression, not in the encodรจd bytes. As such, I think the whatwg reference should probably be to the #codebytes section, not the #bytes section! Furthermore, I think I should PR in a rewrite of that "always" to express something more precise, like: *in the dominant context of UTF-8 encoded text*, multibase codes that correspond to a single byte in UTF-8 can be detected by their first byte-- to accomodate the two exceptions you mentioned, and perhaps future ones as keys and hashes get bigger in the coming years and struggle to fit in the 63 characters of HTTP segments...

The base256emoji base actually serves as a test-case for high-compression bases where UTF-8 is NOT the target character-set... and the prefix was specifically chosen because it would not parse as UTF-8 see initial PR comment from the registrant.

I'll update that PR on the multiformats community repo and see if I get any more useful feedback before iterating the IETF candidate spec to match.

jyasskin commented 1 year ago

That makes some sense: you get the interface that a "base-encoding" is a mapping between byte sequences and "text" == sequences of Unicode codepoints. You should write that definition in the draft RFC. But if you define it that way, how does the identity "base-encoding" work?

bumblefudge commented 1 year ago

Oh definitely, I'm trying to update the draft RFC before IETF SF, but first I'm test-ballooning the substantial changes in the broader community by PRing the community repo first! If you have the bandwidth between now and then, I'd definitely appreciate a glance at my intermediate drafts!

As for the identity entry, I am honestly still researching that one and asking around, mental models here seem to vary.

One hint is that the emoji encoding is presumed to only be used in non-UTF-8 contexts and vice versa, as whatever code handles binary from one would barf on the other-- thus the choice to use the LOWEST, rather than the HIGHEST codepoint to allow recursion/inlining in base256emoji, which the other codecs seemingly do not allow?

Another is that none of the implementations sniff or detect the base-encoding-- my hunch is that the identity entry is the only entry using the lowest codepoint as prefix in UTF-8, and is really designed to only be detectable/useful in "UTF-8 contexts", i.e. only detectable relative to the other UTF-8 entries. Or to put it another way, the first byte being NUL makes it a byte string and NOT any of the other entries. I'm still trying to ascertain whether this maps cleanly to the prefix/codepoint model of strings in whatwg...

Another hint is that the only "default" entries (i.e. the only ones implemented across languages) are the ones optimized for and exclusively designed to handle these UTF-8 byte strings; similarly, UTF-8 collisions with the single-byte multiformats also seems to have been a (quiet) design constraint, even if the codes are considered canonical in Unicode, rather than UTF-8. Which is to say that for implementability and maintenance of libraries, it is probably practical to think of all non-UTF-8 contexts as one kettle of fish/problemset and the rest as future extensions reserved but orthogonal to the current primary usecases...

jyasskin commented 1 year ago

I'm interpreting the discussion of UTF-8 as meaning that the encoded byte sequences for most of the encodings are expected to be valid UTF-8 byte sequences that one could then UTF-8-decode into text (sequences of unicode codepoints).

For the specification, I think you'll need a single interface for the specifications that the registry refers to. That is, either they encode byte sequences to byte sequences, or they encode byte sequences to text. It doesn't really work to say that most of them encode byte sequences to byte sequences, while the emoji one encodes byte sequences to text, because then you don't need multibase to distinguish the emoji encoding from the others: you just check the type of the data you have, and if it's text, you have the emoji encoding.

I think it'll work fine to say that the encoding identifiers are byte sequences rather than individual bytes, as long as you're careful that you don't assign both a sequence and one of its prefixes. The emoji encoding can be identified by the fact that the string starts with 0xF0 0x9F 0x9A 0x80. You could even identify a UTF-16BE encoding by the fact that it starts with 0xD8 0x3D 0xDE 0x80. You could only identify a UTF-16LE encoding if you ensure you never assign = as an algorithm identifier. UTF-32BE conflicts with the identity registration.

msporny commented 1 year ago

Hey folks, sorry that I'm late to the party, too many Github notifications, can't keep up.

The text that @jyasskin is referring to was written before base256emoji came onto the scene, so, it's wrong now and needs to be rewritten. I haven't taken a look at your PR yet @bumblefudge... will do that after IETF 117 is over.

Regarding base256emoji, I can't tell how serious the registrant is (possibly trolling)... or, if they're trying to make the point that @jyasskin is making -- that we can't just assume ASCII anymore.

I also have yet to get a clear answer from the IPFS community regarding the use of the identity byte (0x00). I don't understand why that's in the multicodec table, or the multihash table, but it feels like something that we might want to exclude.

IOW, perhaps we should reject the identity and base256emoji entries because they make everything more complicated than they need to be? At a minimum, we need to understand what the usage of the identity byte is in the table (which I auto-generate by reading the latest multicodec table in the spec). @bumblefudge, would it be possible for you to track down who has that knowledge inside of Protocol Labs / IPFS community?

bumblefudge commented 1 year ago

agreed -- at the very least I'd put base256 as experimental not draft and identity as RESERVED not FINAL (since it's function is detection of other codes). I've got some explanatory text coming. I'll PR this repo as well once an expert has reviewed the upstream PR to confirm my edits!

honestly I think the original intent of both entries would have been clearer if the original table had availed itself of the richer dialect of IANA as laid out in RFC8126 ๐Ÿ˜