ipld / interface-ipld-format

A interface you can follow to implement a valid IPLD format, resolvable through the IPLD Resolver (available in IPFS)
Other
29 stars 5 forks source link

New API proposal #53

Open Gozala opened 5 years ago

Gozala commented 5 years ago

I've being asked to provide feedback on API changes / proposals & sadly that usually means I go and challenge peoples efforts. I think it might be better / only fare to propose API myself and let others challenge it instead. So here is what IMO would be a best API:

Using flow syntax as it removes ambiguity

I would also like to use this opportunity to challenge naming conventions as I have following issues with it:

We'll use opaque type alias BinaryEncoded<a> to refer to binary encoded data of type a.

export opaque type BinaryEncoded<data>:Uint8Array = Uint8Array

We'll also use CID<a> interface to refer to CIDs for data of type a.

export interface CID<a> {
  toString(): string;
  toBaseEncodedString(base?: string): string;
  buffer: BinaryEncoded<a>;
  equals(mixed): boolean;
}

With that out of the way I would propose that valid IPLD Codec be an implementation that complies with a IPLDCodec<a> interface as defined below:

interface IPLDCodec<a> {
  // https://github.com/multiformats/multicodec/blob/master/table.csv
  format:Multihash;
  defaultHashAlgorithm:HashAlgorithm;

  encode(a):BinaryEncoded<a>|Promise<BinaryEncoded<a>>;
  decode(BinaryEncoded<a>):a|Promise<a>;

   links <b>(BinaryEncoded<a>):AsyncIterable<CID<b>>;
   paths <b>(BinaryEncoded<a>):AsyncIterable<string>;
   get <b>(BinaryEncoded<a>, path):IPLDEntry<b>|Promise<IPLDEntry<b>>;
}

type IPLDEntry<b> =
  | IPLDInlineEntry<b>
  | IPLDLinkEntry<b>

type IPLDLinkEntry<b> = { type:"link", target:CID<b>, path:string }
type IPLDInlineEntry<b> = { type:"inline", data:b }

Additional notes:

  1. I have removed cid function, I am convinced that it's a historical artifact. All the implementations I've seen use provided hashAlgoritm (or default if not provided) to create multihash of the encoded data, and then use passed version & multicodec format to instantiate CID.

    In other words I do not see how codec could do anything else, maybe use different serialization method ? Seems unlikely, but if so maybe instead implementation should be required to provide that instead. Another thing I can imagine is a different way to hash ? If so, again should probably codec should just be required to provide optional method for that otherwise cid is method requirement is nothing but a way to introduce bugs.

  2. Anything returning Promise could also return value that it would resolve to. Consumer still can await without any impact & is also free to avoid await when relevant. From implementer perspective there is no need to wrap anything in promise.

  3. This incorporates links API proposed #52

  4. tree is replaced by paths and is scoped to immediate paths, meaning it's not codec implement's job to follow links etc... Generalized tree traversal can be implemented outside and isn't codec implementer concern.

  5. get(blob, path) returns (eventual) "entry" that is either some inline data that was encoded into the blob, or a link with-in the other IPLD node. Note that link has path that corresponds to path to the data with-in that node link is targeting.

  6. Different value for entry type is used ("link", "inline") that enables type checkers (flow, typescript) to refine type with in the union and in general makes it obvious how one figures what kind of entry is back (what if data is null ? what if for some reason returned entry also has being passed a target feild ?)

  7. @mikeal was suggesting in #52 to embrace IPLD Block API which is I'm all up for, however I don't think it should be made codec implementer concern. API proposed in #52 can trivially be layered on top of proposed API and be agnostic of codec implementation.

Concerns

I have brought up in the other thread https://github.com/ipld/ipld/issues/64#issuecomment-473442520 the fact that support for passing parameters should be a fundamental design concern to ensure that things like IPLD Explorer can be just as useful with parametric IPLD Nodes. That however (at least as proposed) conflicts with links(a) which assumes links can be revealed without any parameters. For that reason I would prefer to omit links and replace paths with something like list (Edit: After thinking bit more I don't think list as proposed below is even going help much, I'll get back on this once I'll have more time to think through):

--- a/Users/gozala/Desktop/before.js
+++ b/Users/gozala/Desktop/after.js
@@ -6,8 +6,7 @@ interface IPLDCodec<a> {
   encode(a): BinaryEncoded<a> | Promise<BinaryEncoded<a>>;
   decode(BinaryEncoded<a>): a | Promise<a>;

-  links<b>(BinaryEncoded<a>): AsyncIterable<CID<b>>;
-  paths<b>(BinaryEncoded<a>): AsyncIterable<string>;
+  list<b>(BinaryEncoded<a>): AsyncIterable<{ isLink: boolean, path: string }>;
   get<b>(BinaryEncoded<a>, path): IPLDEntry<b> | Promise<IPLDEntry<b>>;
 }
rvagg commented 5 years ago

This seems to be roughly the same as @mikeal's proposal except for naming changes and having everything as a static method exposed by the codec/format.

I'd like to see some use-cases from @mikeal but I'm suspecting that the ergonimics of passing around a Reader instance might be much nicer than having to have both the opaque BinaryEncoded<a> as well as the format/codec itself. Maybe a better name than Reader could be found though because it's essentially making a BinaryEncodedWrapper.

function doSomethingWithBlock(blockReader) vs function doSomethingWithBlock(codec, block) - the two items are (presumably) inseparable, one doesn't make sense without the other, so tying them together might be both safer and make it less clunky to use.

Regarding naming: I agree on codec (although the distinction seems less obvious than you suggest, they're all referring to formats still) and on encode / decode but mainly because I'm a non-American English speaker and the z's always irk me. I think @mikeal's remaining makes more sense than path in your IPLDLinkEntry<b>, however. It needs little or no clarification.

Gozala commented 5 years ago

This seems to be roughly the same as @mikeal's proposal except for naming changes and having everything as a static method exposed by the codec/format.

Yes, except this does not operates on IPLDBlock which was one of the major changes in there

Gozala commented 5 years ago

I'd like to see some use-cases from @mikeal but I'm suspecting that the ergonimics of passing around a Reader instance might be much nicer than having to have both the opaque BinaryEncoded as well as the format/codec itself. Maybe a better name than Reader could be found though because it's essentially making a BinaryEncodedWrapper.

I am sure there are cases where passing something like Reader is both convenient and necessary, that however doesn’t mean it should be part of core API. In fact making that separate layer would allow generelaize Reader freeing codec implemetera from dealing with it and potential bugs

Gozala commented 5 years ago

I think @mikeal's remaining makes more sense than path in your IPLDLinkEntry<b>, however. It needs little or no clarification.

Depends on the context IMO. If you think about link target node + path makes a lot of sense. Just like in URL host + path makes sense.

If you think about trying to resolve something but not quite making it and there for having remaining piece left probably remaining makes more sense. I don’t like that analogy however, I think it’s awkward way to think about it

I should have made that more clear though

vmx commented 5 years ago

I prefer "remaining", as "path" to me is to ambiguous it's not clear to me if it's the path to the CID or the path to the value after following the CID ("path to the CID" doesn't make sense when you think about it, I would find it non-intuitive though).

I'm fine with "encoding/decoding".

On my TODO list is still a write-up about the different IPLD layers. I need to put more thought into this, but here's the rough idea. Currently we have IPLD Formats and JS-IPLD, but we need more. I like this proposal, this would be about the lowest level where there's only binary data and the IPLD Data Model. One layer on top could then be more about what @mikeal's proposal (https://github.com/ipld/interface-ipld-format/issues/52) is about, where it is about Blocks, hence also about CIDs. The next layer would then be about IPLD Format agnostic APIs, you could e.g. call decode() on a Block and it will automatically use the right format etc. The top layer would then also be about storing and retrieving things.

Gozala commented 5 years ago

I prefer "remaining", as "path" to me is to ambiguous it's not clear to me if it's the path to the CID or the path to the value after following the CID ("path to the CID" doesn't make sense when you think about it, I would find it non-intuitive though).

But it’s not the path to CID it’s path to whatever link is so if you have a link with shape {target, path} it’s pointer to a values at the given target and path.

I also would like to extend that further with inline blocks (if we get there) where link target could also be an inline block rather thatn CID. It also would be very composable with general access. Right now resolver needs to manually traverse decoded object, with all the proposed pieces you could instead return link with in the data structure and let other resolver take over {target: new Block(decodedData, "dag-json"), path: remainingPath) }

Gozala commented 5 years ago

I think “remaining” makes sence if you’re thinking about in terms of reaching a destination. But if you think of link to something, or a poiner then “remaining” starts to sound little confusing.

Maybe {target, at} would be a more clear than either of two regardless of the way you think about it ?

mikeal commented 5 years ago

serialize is awfully long and easy to misspell, encode is shorter and lot harder to get wrong. deserialize is even longer, decode isn't.

+1, I’ll probably go and update my proposal with this as well.

format - Is the way in which something is arranged or set out, in other words it corresponds to how and not what. codec on the other hands is what that knows how to encode / decode to a specific format. There for I would argue that codec is much better term to use as it represents implementation that can encode data into binary of specific format and decode binary in that format back to the data. It also make more sense for codec to provide operations named encode / decode than any other term.

Not too long ago, we had a variety of terms being used to describe the codec, format being one of them. We also had format and codec being used to describe other things which only piled onto the confusion.

I ended up writing up our terminology in the spec repo to try and standardize the language we use.

I went with codec for consistency, as the multicodec project already had their naming locked in and I wasn’t going to be able to change it, and our “codecs” are in-fact references to registered codecs in multicodec.

I then specified the term “format” in our terminology to refer to the existing format we’re leveraging, as that was the thing we were most often getting confused in the terminology.

dag-json is a “codec” that uses the json “format.”

I’m sympathetic to your rational for wanting to see this somewhat reversed but I think it would end up causing confusion given the dependency we have on multicodec 😢

mikeal commented 5 years ago

One last thing to note, the encode interface should be optional. We have codecs that we need to read but will probably never need to write with this interface (git, eth, and bitcoin blocks for instance). In fact, there’s a whole world of content addressed data that we want to be able to address but never modify or author. This is why we wrote the “Data Model” spec, to distinguish between the codecs we’re using for building all of these higher level data-structures and those that we only want to be able to read and link into.

Gozala commented 5 years ago

One last thing to note, the encode interface should be optional. We have codecs that we need to read but will probably never need to write with this interface (git, eth, and bitcoin blocks for instance). In fact, there’s a whole world of content addressed data that we want to be able to address but never modify or author. This is why we wrote the “Data Model” spec, to distinguish between the codecs we’re using for building all of these higher level data-structures and those that we only want to be able to read and link into.

That's fare, but should not such codec just throw exception from encode ?

Gozala commented 5 years ago

dag-json is a “codec” that uses the json “format.”

Wait is this a desired terminology or undesired ? I feel like this exactly the right terminology which I see as aligned with proposed naming no ?

Gozala commented 5 years ago

I went with codec for consistency, as the multicodec project already had their naming locked in and I wasn’t going to be able to change it, and our “codecs” are in-fact references to registered codecs in multicodec.

I'm failing to see conflict here.

Codec encodes data into specific format. Multicodec tags encoded data with corresponding codec identifier to describe it's format.

I think you register codecs with corresponding codec identifier.

Could just terminology be refined to use "codec identifier" where codec is being used now and "codec" used in place where format is today ?

mikeal commented 5 years ago

Wait is this a desired terminology or undesired ?

It’s the current terminology :)

vmx commented 5 years ago

For me a Format implements a Codec.

mikeal commented 5 years ago

Several of these ideas ended up in https://github.com/ipld/js-ipld-stack.