ipfs / specs

Technical specifications for the IPFS protocol stack
https://specs.ipfs.tech
1.16k stars 232 forks source link

Experimental Proposal: CIDv1 -- IPLD, Multicodec-packed, and more #130

Closed jbenet closed 6 years ago

jbenet commented 8 years ago

READ THIS PARAGRAPH FIRST

Hey everyone, the below is a proposal for some changes to IPFS, IPLD, and how we link to data structures. It would address a bunch of open problems that have been identified, and improve the use, tooling, and model of IPLD to allow lots of what people have been requesting for months. Please review and leave comments. We feel pretty strongly about this being a good solution, but we're not sure if we're just drinking the koolaid and going to make things worse. Sanity check before we move further pls? Also my apologies, i would spend more time writing up a better version but i just dont have enough time right now and time is of the essence on this.


[EXPERIMENTAL PROPOSAL] CIDv1 -- Important Updates to IPFS, IPLD, Multicodec, and more.

IPFS migration path to IPLD (CBOR) from MerkleDAG (ProtoBuf)

Multicodec Packed Representation

It is useful to have a compact version of mulicodec, for use in small identifiers. This compact identifier will just be a single varint, looked up in a table. Different applications can use different tables. We should probably have one common table for well-known formats.

We will establish a table for common authenticated data structure formats, for example: IPFS v0 Merkledag, CBOR IPLD, Git, Bitcoin, and more. The table is a simple varint lookup.

IPLD Links Updates (new format)

Open Problems (Motivation)

IPLD allows content to be stored in multiple different formats, and thus we need a way to understand what kind of content is being loaded in when traversing a link. A problematic issue is that old ipfs content (protobuf merkledag) does not use multicodec. It makes it difficult to distinguish between the new CBOR IPLD objects and the old Protobuf objects.

It has been proposed earlier that we wrap protobuf objects with a multicodec. But this is a problem, because the protobuf multicodec would not be authenticated. This is further complicated because many people have been requesting the ability to address raw leaf objects directly (that is, a hash linking to raw content, without ipld nor protobuf wrapping). This is a nice thing to have, but introduces difficulty in distinguishing between a protobuf or a raw encoded object, particularly when neither has a multicodec header which is authenticated by the object's hash. This lack of authentication is an attack vector: adversaries may provide protobuf objects with a raw multicodec, and depending on how implementations handle the multicodec, may poison an implementation's object repo.

Another important performance constraint is that multicodec headers are quite large: /ipld/cbor/v0, for example, is 13 bytes, which is way too large for many applications of small data. Instead, we would like to be able to use a compact multicodec representation ("multicodec packed", a single varint) to distinguish the formats. So that encoded objects are wrapped with minimal overhead. Note that this still does not affect protobuf or raw objects because these do not include headers.

Additional complications include how bitswap sends or identifies blocks, how a DagStore can pull out the object for a multihash and know what format encoding to use for it (eg raw vs protobuf), whether to allow linking from one object type to another, support for multiple base encodings for links, among others.

In discussions we (@jbenet, @diasdavid, and @whyrusleeping) reviewed many different possiblities. We considered possibilities and how it affected linking data, wrapping the data with multicodec, storing it that way under the many layers of abstraction (dag store, blockstore, datastore, file systems), fetching and retrieving objects, knowing what format to use when, ensuring values are authenticated and not opening up vectors for attackers to poison repos, and more.

In the end, we came up with a few small changes to how we represent IPLD links that solve all our problems (tm) \o/. These are:

It is worth crediting many people here that have tirelessly pushed hard to get a bunch of these ideas out. @davidar @mildred @nicola to name a few, but many others too. But they haven't looked at this yet. this first post is the first they'll hear of this construction, and they may very well hate this particular combination of ideas :) please be direct with feedback, the sooner the better.

IPLD Links learn about Base Encoding

We propose adding a multibase prefix to representations of IPLD links. This is particularly important where the encoding is not binary.

At this time, we recommend not including it in direct storage, where it should be binary. However, it may be found during the course of review that it is better to always retain the multibase prefix, even when storing in binary.

This change is a much requested option to support multiple encodings for the hashes. Current links use by default base58, which is perfect for URLs as it doesn't contain any non supported char and can be easily copy-pasted, however, for performance reasons, it is not always the best format. Some users already encode IPFS multihashes in other bases, and therefore it would be ideal to have all IPFS and IPLD tooling support these encodings through multibase, avoiding confusing failures.

IPLD Links acquire a version

The fact we propose here changes to the basic link structure remind us of the basic multiformats principle:

"Never going to change" considered harmful.

therefore we deem it wise to ensure that henceforth we include a version so that evolution can be simple, and not complex. The below changes suggest a way to distinguish between old and new links, but we should avoid such situations in the future, as this approach leverages knowledge about multihash distributions in the wild. This will be less feasible in the future.

IPLD Links learn about Codecs

The most important component of these changes introduces a multicodec-packed varint prefix to the link, to signal the encoding of the linked-to object. This enables the link to carry information about the data it points to, and ensure it is interpreted correctly. This ensures that the multicodec prefix is NOT necessary for interpretation of an IPLD object, as the link to the object carries information for its interpretation.

All proper IPLD formats (cbor and on) should carry the multicodec header at the beginning of their serialized representation, which authenticates the header and ensures clients can interpret the object without even having a link. But, this is not possible with objects of formats created before the IPLD spec, such as the first merkledag protobuf object codec in IPFS (go-ipfs 0.4.x and below). This includes also objects from other authenticated data structure distributed systems, such as Git, Bitcoin, Ethereum, and more. Finally, raw data -- which many hope to be able to address directly in IPLD -- cannot carry an authenticated prefix either.

The approach of adding the multicodec to the link entirely side-steps the problem of not being able to authenticate multicodec headers for protobufs, git, bitcoin, or raw data objects. And this avoids a nasty repo poisoning attack, possible in other proposed suggestions that rely on an unauthenticated multicodec header (carried along with the object) to determine the type of an object.

This also ensures that IPLD objects can still be content-addressed nicely, without needing to also store codec metadata alongside.

This change has been long-proposed in other forms. These other forms usually suggested attaching a @multicodec key to IPLD link objects (as a property on or next to the link), which was cumbersome and introduced complexity in other ways. Specially, it was not easy to carry over this info to a URL or copy-pasted identifier.

This multicodec-packed prefix will be sampled from a special table, maintained along with the IPLD spec. This table is expandable over time. A global multicodec table could grow from this one, or start separately.

Content IDs

This document will use the words Content IDs or CIDs. this abstraction is useful here but may not be useful beyond it. Another word -- albeit much less precise -- may be IPLD Link.

Other options are:

Let the old base58 multihash links to protobuf data be called CID Version 0.

CIDs Version 1 (new)

Putting together the IPLD Link update statements above, we can term the new handle for IPLD data CID Version 1, with a multibase prefix, a version, a packed multicodec, and a multihash.

<mbase><version><mcodec><mhash>

Where:

Note that all CIDs v1 and on should always begin with <mbase><version>, this evolving nicely.

Distinguishing v0 and v1 CIDs (old and new)

It is a HARD CONSTRAINT that all IPFS links continue to work. This means we need to continue to support v0 CIDs. This means IPFS APIs must accept both v0 and v1 CIDs. This section defines how to distinguish v0 from v1 CIDs.

Old v0 CIDs are strictly sha2-256 multihashes encoded in base58 -- this is because IPFS tooling only shipped with support for sha2-256. This means the binary versions are 34 bytes long (sha2-256 256 bit multihash), and that the string versions are 46 characters long (base58 encoded). This means we can recognize a v0 CID by ensuring it is a sha256 bit multihash, of length 256 bits, and base58 encoded (when a string). Basically:

We can re-write old v0 CIDs into v1 CIDs, by making the elements explicit. This should be done henceforth to avoid creating more v0 CIDs. But note that many references exist in the wild, and thus we must continue supporting v0 links. In the distant future, we may remove this support after sha2 breaks.

Note we can cleanly distinguish the values, which makes it easy to support both. The code for this check is here: https://gist.github.com/jbenet/bf402718a7955bf636fb47d214bcef8a

IPLD supports non-CID hash links as implicit CIDv1s

Note that raw hash links stored in various data structures (eg Protbouf, Git, Bitcoin, Ethereum, etc) already exist. These links -- when loaded directly as one of these data structures -- can be seen as "linking within a network" whereas proper CIDv1 IPLD links can be seen as linking "across networks" (internet of data! internet of data structures!). Supporting these existing (or even new) raw hash links as a CIDv1 can be done by noting that when on data structure links with just a raw binary link, the rest of the CIDv1 fields are implicit:

Basically, we construct the corresponding CIDv1 out of the raw hash link because all the other information is in the context of the data structure. This is very useful because it allows:

Given the above addressing changes, it is now possible to directly address and implement native support for Git, Bitcoin, Ethereum, and other authenticated data structure formats. Such native support would allow resolving through such objects, and treat them as true IPLD objects, instead of needing to wrap them in CBOR or another format. This is the proper merkle-forest. \o/

IPLD addresses raw data

Given the above addressing changes, it is now possible to address raw data directly, as an IPLD node. This node is of course taken to be just a byte buffer, and devoid of links (i.e. a leaf node).

The utility of this is the ability to directly address any object via hashing external to IPLD datastructures, which is a much-requested feature.

Support for multiple binary packed formats

Contrary to existing Merkle objects (e.g IPFS protobuf legacy, git, bitcoin, dat and others), new IPLD ojects are authenticated AND self described data blobs, each IPLD object is serialized and prefixed by a multicodec identifying its format.

Some candidate formats:

There is one strong requirement for these formats to work: a format MUST have a 1:1 mapping to the canonical IPLD serialiation format. Today (July 29, 2016), that format is CBOR.

Changes to Interfaces / Specs

Need changes to:

It is a HARD CONSTRAINT that all IPFS links continue to work. In order to support both CID v0 paths (/ipfs/<mhash>) and the new CID v1 paths (/ipfs/<mbase><version><mcodec><mhash>, IPFS and other IPLD tooling will detect the version of the CID through a matching function. (See "Distinguishing v0 and v1 CIDs (old and new)" above).

The following interfaces must support both types:

jbenet commented 8 years ago

cc @davidar @mildred @nicola for quick review. the more concise you can make the review the better. A simple "thumbs up" "thumbs down" or short statement of support or disfavor will be ideal. We can start with that and see how we feel instead of lots of bikeshedding. (we already had so many hours of bikeshedding on a whiteboard...). We presume you will like it as it addresses a lot of what you've pushed for along the way.

nicola commented 8 years ago

The following https://github.com/ipld/specs/issues/7 and https://github.com/ipld/specs/issues/6 and https://github.com/ipld/specs/issues/8 are relevant for the conversation of IPLD on CBOR & the future of IPLD. Will give you my 2 cents overnight

nicola commented 8 years ago

@jbenet

It all sounds really great! Thanks for taking a big stab at this. I agree with excitement with most, the following are my concerns

whyrusleeping commented 8 years ago

@nicola in response to your last point, Adding the multicodec to the keys allows us to address raw data. Since the reference to the object (the key or a link) tells us that the data is raw, we don't have to worry about parsing the data in any way.

nicola commented 8 years ago

@whyrusleeping so IPLD can address raw data? This means that in those cases there is only one uri mapping hash -> data and no possible traversing (say hash/part1).

Then we must find a way to differentiate objects from raw data (this difference can be implicit or discovered lazily). This is some sort of typing when linking to a new IPLD so that we can know in advance whether this is an object or a byte array

whyrusleeping commented 8 years ago

@nicola the key tells you what type the object its pointing to is.

for example:

axyQmABC

So when we request axyQmABC we get back data, and know how to handle it already.

axyQmABC and axzQmABC could point to the same exact underlying stream of bytes, but the different multicodecs inform the system how we want it to be handled.

whyrusleeping commented 8 years ago

Any updates here?

jbenet commented 8 years ago

Nobody has said "this sucks!" so i think this is good to go? we should implement go-cid and see how it goes.

well for that we need go-multibase first, and go-multicodec/packed

On Thu, Aug 11, 2016 at 7:28 PM, Jeromy Johnson notifications@github.com wrote:

Any updates here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ipfs/specs/issues/130#issuecomment-239323786, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIcoRFzrcIENM2t0-qJrE0yBRGlmfD4ks5qe7AggaJpZM4JZ4Dm .

whyrusleeping commented 8 years ago

@jbenet do you have a spec for both of those?

robcat commented 8 years ago

Sorry if I'm late in the discussion, I am trying to understand how this proposal will affect the IPFS implementation with regard to the leaf nodes of the graph.

Under this proposal, if <mcodec>=="raw data" I can be assured that the link points to a leaf of the graph. But the reverse will not be necessarily true, because the same data can also be wrapped in an equivalent data-only IPLD object.

But it seems to me that data-only IPLD objects will not provide more value that a plain "raw data" blob. Will IPFS default to build leaf nodes using the "raw data" codec?

whyrusleeping commented 8 years ago

@robcat

Will IPFS default to build leaf nodes using the "raw data" codec?

That will be the goal, yes. When ipfs add generates graphs, it will use raw blocks as the leaf nodes.

jbenet commented 8 years ago

Worth noting that it also depends on whether people want to be able to use those raw blocks as "files" of their own. If yes, then either unixfs must be taught how to do that from a raw block, or the file structure should still exist around the raw block. if not, we then have to generate two objects per block, just the raw data, and a very small object pointing to the raw data that makes it a file. (the important thing on this is being able to aggregate the indexing data structures of files wherever). It ultimately is a discussion trading off the advantages of raw blocks, vs the expressive advantages of "everything is a file", and how to tune those two.

(btw, this is separate from the "raw blocks should be smaller than ( + overhead)", which was an issue we discussed previously. (not sure if still is a problem, @whyrusleeping ?)

kevina commented 8 years ago

If we do decide that leaf nodes are raw blocks and this information is encoded in the link it will greatly simply a huge part of my filestore code (ipfs/go-ipfs#875 and ipfs/go-ipfs#2634).

As likely already noted, it will also greatly speed up graph traversals since we would no longer need to read (or fetch) a large block only to discover it is a leaf node.

kevina commented 8 years ago

Has there been any discussion on how <mbase> will be handled? It defines how the rest of the string is to be interpreted so it can't be binary representation encoded somehow. As I see it having the first character be a numbers might cause problems is some contexts (domain names?, as an identifier in a programing language), and it probably not a good idea for it to be case sensitive.

It is also worth noting that just specifying the base is insufficient as the alphabet used must also be included, and with base32 and base64 there is the annoyance of pad characters.

jbenet commented 8 years ago

@kevina check out the brand new https://github.com/multiformats/multibase -- WIP. Pls comment there.

common alphabets will be different codes. Starting with numbers is unavoidable. These indents should not be variable names anyway, they should be strings or buffers. On Mon, Aug 22, 2016 at 22:25 Kevin Atkinson notifications@github.com wrote:

Has there been any discussion on how will be handled? It defines how the rest of the string is to be interpreted so it can't be binary representation encoded somehow. As I see it having the first character be a numbers might cause problems is some contexts (domain names?, as an identifier in a programing language), and it probably not a good idea for it to be case sensitive.

It is also worth noting that just specifying the base is insufficient as the alphabet used must also be included, and with base32 and base64 there is the annoyance of pad characters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ipfs/specs/issues/130#issuecomment-241608978, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIcofcNrtvuNIUJ6AN205JPlGBuARcNks5qiloZgaJpZM4JZ4Dm .

Ericson2314 commented 8 years ago

Hmm, overall I like where this is going a lot! I think maybe a little bit more clarity on data model vs representations would be helpful. I'd also be a fan CID being distanced from IPLD, but this is more a matter of terminology than design.

I think the most import thing to do is separate the "codecs" in links, from the "codecs" in the data itself. The first encodes data model. By default, every object fits just one data model. Exceptions of this norm should be because those that define the subsuming model adopt one---for example, if and only if the Ethereum developers decide every git object is also an Ethereum object should IPFS conflate Ethereum and git objects.

The second however encodes representation/wire formats/etc. Given a data model, there may be many ways to represent it, and the node codec shows which one is used. Unlike with data models, object aren't very deeply associated with any representation. An object may be transmitted in one format, stored in another, etc, etc.

So I'm a bit confused by the use of multicodec for both roles. Conceptually a data model is an abstract type with methods to give the hash of a value, and CIDs of a value's children. A representation however is just a pair of a encoder and encoder between the abstract data model and some concrete format. Things like "Git", "Etherium", "IPLD", and "BitTorrent" are candidates because they are data models using content addressing. Representations would be "Git store entry", "JSON text", ""CBOR", etc. Crucially, the general case is that representations are data-model-specific; even if model models admit a JSON representation and IPFS node need not care that something called "JSON" can understand either data model with the right representation. Moreover, the representations in aggregate don't overlap with the data models, unless the models named are punned for canonical representations. Finally, other things IIUC multicodec supports like PNG aren't terribly useful as a data model or representation.

Regarding terminology, I think it's best to think of IPLD as just another format referable with a CID. As it happens, an IPLD link is very similar to a CID (but optionally suffixed with a relative path), but that need not mean CID needs to special case IPLD. I hope it is indeed intentional the CIDs don't contain relative paths---not all merkle-dag-like format's associated string keys with nodes' children. IPLD should also be updated to make clear that relative paths only make sense for certain kinds of CIDs.

[It's getting late so apologies in advanced for typos.]

jbenet commented 8 years ago

Status Update

CID is marching along. we've got to finish multibase, multicodec-packed, and CID specs, then implementations. So far:

KrzysiekJ commented 8 years ago

To somewhat support what @Ericson2314 wrote:

CIDs encapsulate information about codecs and do it in a compact way, which requires additional table containing mapping of integers to particular codecs. This allows adding Git, Bitcoin and Ethereum as special cases of codecs, which looks promising. The problem is: there are currently 728 cryptocurrencies listed on Coinmarketcap and new ones are constantly flourishing. (Yes, most of them are redundant and worthless, but still this is a much differentiated environment). Furthermore, there are many software applications that can yield new Merkle DAG file formats; some of them may be so specific that putting their formats in a global table will be purposeless (regardless of that it may be infeasible). Having a common table of multicodec ids would also mean that to build a new Merkle application on top of IPLD one would need to register a multicodec id (if he is not willing to conform to the IPLD data format or codecs), which would add a layer of centralization.

On the other hand, in ipfs/ipfs#90 there seems to be an agreement that there is a need to support MIME types “either as an intermediate object, or in the link to them”. Information about a particular codec used seems to be easily addable to MIME type. We could say than an object is of type, for example, application/ipld+cbor, ipld/cbor or merkle/ipld;codec=cbor. An IPLD client would either know how to parse a particular MIME type and how to turn it into a Merkle DAG object, or would treat it as opaque from the forest’s point of view. This mechanism could be pluggable, so that anyone will be able to add support for his own tree format. It would:

  1. Make IPLD much more interoperable with external applications and their custom data structures.
  2. Solve the problem of lack of MIME types for raw data.
  3. Remove a dependency on a centralized multicodec table.

It may seem to be a problem that MIME types are much less compact than multicodec ids, but in many cases of tree traversing they can be specified just once: if we know that a link is representing, for example, ipld/cbor, then we can implicitly assume that links inside this object will be also in ipld/cbor (unless specified otherwise).

Of course, adding MIME types for IPLD container objects will not solve the problem of MIME types of raw data stored in those objects. This information could be perhaps encapsulated into the objects themselves. Therefore to store, for example, a small PNG image in IPLD, one could do one of the below:

  1. Store raw PNG data in IPLD and encapsulate MIME type image/png in a link to that data.
  2. Store raw PNG data in IPLD and encapsulate MIME type image/png in a link stored in an intermediate merkle/ipld object.
  3. Encapsulate raw PNG data in a merkle/ipld object which contains also the information about image/png MIME type.
daviddias commented 6 years ago

Closing this discussion on this repo. Let's use it as reference and continue any future discussion on https://github.com/ipld/cid.