EIP1577 - Multiaddr support for ENS

daviddias commented 5 years ago

Conversation happening at https://ethereum-magicians.org/t/eip1577-multiaddr-support-for-ens/1969

raulk commented 5 years ago

Was reviewing EIP-1577. My comments:

I sense that multicodec and multiaddr are being mixed in a way I don't quite follow.
I was confused by the separate ensdomains/multicodec repo.
The EIP says:

The encoding of the value depends on the content type specified by the protoCode; for instance, types in the range 0x00-0xf0 are encoded using multihash, meaning their value is formatted as follows [...]

I am assuming they refer to the ensdomains/multicodec codelist, but I don't find anything that points to specific ranges being reserved for specific purposes.

In the fallback section, it's not clear what "the multiaddr interface" refers to. The spec talks about multicodec and that's the first time that multiaddr is mentioned.

During my research process I noticed our prose concerning the interrelationship of multicodec, multiaddr and multihash could be a bit off. Or at least it feels imprecise (to this reader).

For example, the multicodec README says:

A chunk of data identified by multicodec will look like this:

<multicodec><encoded-data> (1)

It is worth noting that multicodec works very well in conjunction with multihash and multiaddr, as you can prefix those values with a multicodec to tell what they are.

Then, in the multihash README, the values are described as:

<varint hash function code><varint digest size in bytes><hash function output> (2)

However, based on my understanding:

<multicodec> (1) == <varint hash function code> (2)
<encoded-data> (1) == <varint digest size in bytes><hash function output> (2)

If my interpretation is correct, wouldn't it be more precise to state that multiaddr and multihash embed multicodec codepoints? The term "prefix" is misleading.

Stebalien commented 5 years ago

I sense that multicodec and multiaddr are being mixed in a way I don't quite follow.

I believe the confusion stems from the fact that multiaddrs are paths and /ipfs/Qm..., /ipns/Qm... are paths (/ipfs/Qm... is even a valid multiaddr). Worse, multiaddr has the word "address" in it but we don't use them to address content.

If my interpretation is correct, wouldn't it be more precise to state that multiaddr and multihash embed multicodec codepoints? The term "prefix" is misleading.

I'm not sure I follow. We say "prefix" because a mulithash is <varint codec><length><data> and a multiaddr is <varint multicodec><stuff>. In both cases, we prefix some data with a multicodec to create either a multihash or a multiaddr.

Note: this also came up here: https://github.com/multiformats/multiaddr/issues/73

raulk commented 5 years ago

In both cases, we prefix some data with a multicodec to create either a multihash or a multiaddr.

@Stebalien I'm reading the README as a spec -- maybe I shouldn't. But with those lenses, if multihash is defined <varint codec><length><data> and multiaddr as <varint multicodec><stuff>, multicodec does not prefix multihash and multiaddr, multicodec is the prefix. Personally I tripped over that, thinking that multicodec somehow encapsulates the other two.

In the case of multiaddr, given that the <varint multicodec><stuff> groups are repetitive, saying that codec is the prefix falls short. Yes, all composed multiaddrs will start with a multicodec, but there will be more instances in a single composed multiaddr, right?

Stebalien commented 5 years ago

Ah, I see. You're right, we don't prefix the multihash, we prefix the length-delimited hash digest.

Yes, all composed multiaddrs will start with a multicodec, but there will be more instances in a single composed multiaddr, right?

So... this is one of the issues with multiaddrs. Without understanding the what each protocol code means, I can't break a multiaddr into a sequence of (codec, value) pairs. Really, it's better to think about multiaddrs as being recursively defined. That is, to parse a binary multiaddr, you:

Read off the multicodec.
Lookup the protocol definition for that codec.
Pass everything else to that codec.
The codec should consume as much as it wants.
The codec should then recursively parse everything it doesn't want as a multiaddr.

Really, this also applies to strings as a single multiaddr "component" could either be:

/ip4/1.2.3.4
/quic (no argument)
/unix/a/b/c/d.... (everything after /unix is an argument).

raulk commented 5 years ago

@Stebalien have we considered introducing boundary markers for values? This has the ability to solve the nesting and the value identification problem at once. Using square brackets as the textual representation:

/unix[/a/b/c/d]/grpc
/ip4[1.2.3.4]/udp[8888]/quic
/ip4[1.2.3.4]/tcp[8888]/p2p-circuit[/ip4[2.3.4.5]/udp[9999]] (nesting example)

Stebalien commented 5 years ago

We've discussed it for embedding path values (/unix/[/...]/...) but not in other cases.

The real issue with brackets is that they kind of break the path abstraction (coming from the plan9 "everything lives in the filesystem namespace" camp).

If you're interested in some ramblings on this subject... https://gist.github.com/4764975c3b5ea33202324d8e9ec0985d (not posting a PR/issue as I don't want to add too much confusion to the discussion unless we decide to go with it).

(note: defining these recursively is really just my opinion; by spec, multiaddrs are still defined as a list).