ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Links v2 #83

Open mikeal opened 5 years ago

mikeal commented 5 years ago

We've been throwing around this idea for an improved Link for a little while now. In fact, the early specifications for links stated they had more features than we currently have with just CID.

Basically, we want a Link to represent a CID + Path. A few open questions:

Stebalien commented 5 years ago

Is this a new standard or is this CIDv2?

I'd say it's a new thing. Really, it's just a binary format for /ipld/Qm.../a/b/c.

What do we do about collections (like HAMT) where the string path elements are different from the block property implementation?

Really, we can:

  1. Layer it: For now, we can define a new path namespace for IPLD+ (IPLD data model). This namespace would transparently handle things like HAMTs. In the future, we'd introduce this at the type-system level.
  2. Build-in sharding: That is, define some codecs that span multiple blocks. We'd have to add a new codec for HAMTs.

Given prior discussion on sharding and the decision (I think?) to punt sharding out of IPLD proper (https://github.com/ipfs/notes/issues/76), I'd rather go with 1.


If we are waiting on that for the next version of the Path spec should we wait to specify this new Link spec until the new Path spec lands?

My primary motivation here is being able to point an object inside a block. Given that motivation, we don't really need support for higher level paths at the link level. That is, we just need to resolve the higher level path to an IPLD path.

IMO, we should keep these as simple as possible.

vmx commented 5 years ago
1. Layer it: For now, we can define a new path namespace for IPLD+ (IPLD data model). This namespace would transparently handle things like HAMTs. In the future, we'd introduce this at the type-system level.

Do you mean something like resolve('/hamt/<cid>/deep/path') and then the resolver will take care of it? If that's a yes, I'm in favour of that.

My primary motivation here is being able to point an object inside a block.

How would a valid resolve for such a new link look like. Given these blocks:

<CID:main>
{
  deeplink: <CID:subcontent>/a/deep
}

<CID:subcontent>
{
  a: {
    deep: {
      thing: 'something'
    }
  }
}

In order to get something would you do

resolve('/<CID:main>/deeplink/a/deep/thing`)

or

resolve('/<CID:main>/deeplink/thing`)

?

I forgot to /cc @achingbrain who might also have ideas around that.

achingbrain commented 5 years ago

I think having it at the resolver level would be better (e.g. a dedicated HAMT codec) so it is transparent to the application.

When the application (primarily thinking IPFS here) gets given a path, it doesn't know if it's going to traverse any HAMT shards or not, so has to inspect the nodes IPLD returns segment-by-segment to see what type they are and take action accordingly.

If the IPLD resolver is traversing the graph to resolve a path anyway, to me it makes more sense to resolve any HAMT shards or other esoteric data structures encountered along the way instead of forcing the application to do lots of small traversals and to support HAMT/whatever other data structures we decide to introduce.

We do something like this in js-ipfs-unixfs-exporter at the moment, but the HAMT traversal stuff is not really UnixFS-specific so could be done further down the stack.

The application can then treat IPLD and IPFS paths as synonymous. We came across this problem recently in ipfs/js-ipfs-unixfs-exporter#1 (comment).

Stebalien commented 5 years ago

Do you mean something like resolve('/hamt//deep/path') and then the resolver will take care of it? If that's a yes, I'm in favour of that.

Yes (although we might want to think up a better name than "hamt").

How would a valid resolve for such a new link look like. Given these blocks:

The latter. That is, <CID:subcontent>/a/deep points to {thing: 'something'}.


I think having it at the resolver level would be better (e.g. a dedicated HAMT codec) so it is transparent to the application.

I agree that making this all transparent to the application is important (unless the application wants to see through the abstraction). The question is: should we do this in two layers or one. The two layer approach wouldn't try to force applications to handle this; instead, applications would use this new layer instead of using IPLD directly.

There are really two (not insurmountable) issues with a dedicated HAMT codec:

  1. Introspection and deserialization to JSON becomes difficult to impossible unless we completely decouple the IPLD path system from the data's structure. Currently, each block contains an atomic "patch" of the IPLD graph. If we introduce HAMTs as a first-class thing, a single HAMT block won't stand alone.
  2. It uses codecs for "types". The hope was to make DagCBOR "the format to rule them all". That is, the general-purpose codec to encode any data structure we'd like.

For me, 1 makes it clear that there really are two layers. There's a layer where each block can stand alone (with some internal structure) and a layer that can seamlessly shard a large object across multiple blocks. If we try to merge these into a single layer, we lose the ability to understand individual blocks.

However, this brings us to 2 and the need for types. If we do use a HAMT codec, we have a clear way to see if something is a HAMT or something else. Without that, we need some other way to tell if something's a HAMT (in layer 2).

vmx commented 5 years ago

The question is: should we do this in two layers or one.

I would do two layers. One reason for that is graph replication (aka. Graphsync). I want it to live within the IPLD Resolver implementation. So you would have an Bitswap IPLD Resolver we have today, which only understands blocks. We then will have some next generation IPLD Resolver which can also retrieve several blocks at the same time.

So you swap out the IPLD Resolvers depending on your needs/new development. Hence I'd like to keep that part as simple as possible. Things like HAMT should then be built on top, so that the can leverage new kind of IPLD Resolvers without changing their code.

mikeal commented 5 years ago

We haven't spent a lot of time on the Path spec but breaking it into layers sounds like a good plan. Layer 1 being strictly node properties for paths and Layer 2 including some collection implementations, like hamt. Can we please stop calling them codec's though, we have a thing called a codec already :)

Links v2 can point to Path Layer 2 for the path resolution of these specific collections.

Stebalien commented 5 years ago

So you swap out the IPLD Resolvers depending on your needs/new development.

I just want to make it clear that, no matter what we do, /ipld/Qm../a/b/c must mean the same thing every time. Really, I'd rather call these "Path Resolvers" instead of "IPLD Resolvers".

Can we please stop calling them codec's though, we have a thing called a codec already

Good point. We should probably start calling the multicodec code a code not a codec as well.