multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
338 stars 202 forks source link

Codecs for ENS Contenthash: URI [0xF2] and Data URL [0xF3] #353

Open adraffy opened 3 months ago

adraffy commented 3 months ago

ENS (Ethereum Name Service) encodes contenthash() using multicodec. The purpose of a contenthash() is to describe the web contents for a corresponding ENS name.

Currently, ENS supports IPFS, IPNS, Swarm, Arweave, Onion, etc.

Example using IPFS:

We would like to support the following (2) new codecs:

rvagg commented 3 months ago

I think this seems reasonable, though novel. I'm not so sure about introducing a new tag, data for this though. Would namespace as well for that be OK? Even that doesn't map super cleanly onto what you're doing here.

Do you think you'll want more of these into the future? I wonder if we can't figure out a better tag whether this should just be an entirely new classification.

@vmx, what do you think?

vmx commented 3 months ago

I wonder if URI could use a Multiaddress instead. Would that be an option (I know to little about the Eth/ENS ecosystem).

adraffy commented 3 months ago

namespace works. I'd be happy to change it to whatever you suggest.

IMO, the closest codec is json which oddly uses tag:ipld.

I picked tag:data as unlike most codecs, data-uri is both a codec and the data itself.


I think tag:multiaddr for uri suggests too much internal encoding, as we want something maximally general (a literal UTF-8 string) where the content is ultimately validated by the client (since URL standards are ever-evolving)

vmx commented 3 months ago

I think tag:multiaddr for uri suggests too much internal encoding, as we want something maximally general (a literal UTF-8 string) where the content is ultimately validated by the client (since URL standards are ever-evolving)

Keeping it simple makes sense.

aschmahmann commented 3 months ago

IIUC this is related to https://discuss.ens.domains/t/draft-ensip-17-datauri-format-in-contenthash/18048/28 and https://github.com/ensdomains/docs/pull/165.

Apologies for the long text, I'm going to be OOO for a couple days and wanted to make sure to leave some context. cc @lidel who has been involved in the ENS work and interop here since long before me 😅.

TLDR:

Some thoughts:

URI

I wonder if URI could use a Multiaddress instead

Probably not multiaddress itself, but harmonization with something like multipath https://github.com/multiformats/multiformats/pull/55 would likely make this work and be pretty sensible. It would likely also let us use the 0x2f as an escape hatch for people generally wanting to use/experiment with strings rather than code numbers which is what this roughly does (otherwise, the codes like for http could potentially be used instead).

FWIW libp2p has recently proposed going the other way as well (i.e. representing multiaddrs as URIs https://github.com/multiformats/multiaddr/pull/171).

I don't in principle have an objection to a URI based namespace, the two byte range is probably fine although URIs could probably tolerate even three due to the size of the data.

Perhaps more of an ENS-related comment, but want to call out:

  1. There is some redundancy here because for any namespace (IPFS, Swarm, etc.) you could encode under the URI namespace or under their individual namespace. Not necessarily a big issue here, but certainly a change implementations will need to take care of
  2. Related to ^ it seems like this could have always been the case, I'm not sure the historical context here but probably worth validating with folks who did this in earlier rounds that this makes sense. Totally fair + reasonable to say we want to save some bytes with known namespaces and then have the utf-8 URI escape hatch (although I don't know if "contenthash" is a reasonable name for this kind of thing 🙃).

Data URL

Seems fine, although maybe the three byte range (along with arweave, skynet, etc.) makes more sense here given these will likely be larger anyhow.

A few comments / thoughts:

  1. Given the above technically this already works as a Data URI, right https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs, right? If so, I assume the idea is to preserve space by not needing to do base64 encoding.
  2. While saving some bytes here seems fine. This seems non-optimal in that it both isn't as compact as it could be (e.g. mime types are still expressed in text), not flexible enough to include any other metadata, and we couldn't work around it within the existing namespaces (the URI namespace adds a sort of escape hatch here as long as you assume names won't collide).
    • In my bias as someone who works on the IPFS project IMO this could've/should've been resolved by having the tooling for this either in IPFS (either in UnixFS, CID, or another IPLD format), and this seems like as good a time as any to resolve it independently of what happens in this PR (although it may justify bumping to 3 bytes)
      • A CID with the identity multihash and raw codec (or sometimes codecs like JSON or CBOR) would've been sufficient except for the need of a mimetype
      • Technically this could be resolved in a few different ways, one is https://github.com/ipfs/specs/issues/257, note: the latest request here came from the ENS community as well so definitely seems like a good opportunity to chat anyhow
      • Given the very large number of ENS contenthash records that are IPFS-based this seems like something we could/should fix or the hack within ENS (whether in ENS or the "contenthash" namespace could fix either)
      • I understand this isn't really the place for an ENSIP comment and with my "multiformats hat" I don't have objection, but if you want to chat would definitely be happy to
0xc0de4c0ffee commented 3 months ago

IMO, the closest codec is json which oddly uses tag:ipld.

everything is IPLD 😄

🙏 everyone, I'm one of author of that data:uri ENSIP draft proposal, https://discuss.ens.domains/t/draft-ensip-17-datauri-format-in-contenthash/18048 using simple namespace hex("data:") format.

We did our homework before sending draft over ENS forum to make an exception for hex("data:") prefix for reasons below..

a) mime/content type support in cidv1 is pending for loong time (?wen cidv2?)

https://github.com/multiformats/multicodec/pull/159 https://github.com/multiformats/multicodec/issues/4

b) ENS already supports string(data:uri) format in avatar records, so contenthash with plaintext bytes(data:uri) as hex("data:") namespace is full RFC2397 & it won't collide with cidv1 namespaces. https://datatracker.ietf.org/doc/html/rfc2397

if(contenthash.startsWith("e301")){
    //ipfs
} else if(contenthash.startsWith("e501")){
    //ipns
}
// else... other contenthash namespaces...
else if(contenthash.startsWith(hex("data:"))){
    //datauri
}

ENS is not ready for such changes with new ENSIP specs, all contenthash MUST follow namespace+CIDv1 format. && we're back to square one, using raw data in cidv1 with IPFS namespace.

our current working specs for on-chain raw IPFS+CIDv1 generator without content/mime types..

import { encode, decode } from "@ensdomains/content-hash";
import { CID } from 'multiformats/cid'
import { identity } from 'multiformats/hashes/identity'
//import * as cbor from '@ipld/dag-cbor'
import * as json from 'multiformats/codecs/json'
import * as raw from 'multiformats/codecs/raw'
const utf8 = new TextEncoder()

const json_data = {"hello":"world"}
const json_cid = CID.create(1, json.code, identity.digest(json.encode(json_data)))
const html_data = "<h1>Hello World</h1>";
const html_cid = CID.create(1, raw.code, identity.digest(utf8.encode(html_data)))

This all works ok using json/raw data.. only down side, there's no content/type in CIDv1 so we've to parse/guess magic bytes in raw data on client side OR request ipfs gateways to resolve that.

we can even use dag-cbor to link multiple files/ipfs cids.. but on public ipfs gateways there's no index file and ipfs __redirect supported. we've to happily decode that on our "smart" clients for now.

const blog = CID.parse("bafybeidnycldkehcy6xixzqg72vad6pitav4lk5np3ev6tr6titlkvfpvi")
let link = { json: json_cid, "/": html_cid, "index.html": html_cid, blog: blog }
let cbor_link = CID.create(1, cbor.code, identity.digest(cbor.encode(link)))

Back to @adraffy's f3 namespace, I'd suggest this format..

const data_uri = "data:text/html,<html>hello</html>";
const data_cid = CID.create(1, raw.code, identity.digest(utf8.encode(data_uri)))
adraffy commented 3 months ago

@aschmahmann and @0xc0de4c0ffee thanks for the feedback.

As for codec numbers, I'd be happy with any assignment. Initially picked lower numbers since these two codecs seem useful beyond ENS.

Yes, you could put both ipfs://... and data:... into uri however there is a difference w/r/t how they are handled and interpreted. These details were not included as they are ENS application-specific, but possibly the codec names should reflect that, eg. Redirect URI.

From the ENS + web content perspective:

You are correct about the base64 overhead concern, but there is also URL length limits (vs body)

Coffee, I put your response on ENS forum