multiformats / js-cid

CID implementation in JavaScript
MIT License
97 stars 39 forks source link

Provide non-generic constructor methods #83

Open Gozala opened 5 years ago

Gozala commented 5 years ago

I would very much like to make construction of CID's less generic as it really reduces cognitive overhead (saves you from side-trips) when reading code that does new CID(x) because from the context x can be many different things.

Which that goal in mind I'd like to propose adding distinct distinct static methods:

Ideally CID() then could be made non generic CID(Version, Codec, Multihash, name:MultibaseName='base58btc') I would even go and make last param mandatory. If really desired generic version could also be added: CID.from(CID|BaseEncodedString|Buffer|SerializedCID):CID<a> (although I'd prefer not to)

I do understand that maybe changing constructor API might be too big of a breaking change so I propose to gradually fade it out. Maybe initially CID.create(Version, Codec, Multihash, MultibaseName) could be added instead and slowly migrate users to new ways until it's safe to make a switch.

mikeal commented 5 years ago

CID.fromCID(cid) - Does that even makes sense ? Should there maybe a clone() method instead ?

+1 on a .clone(cid) method.

CID.fromJSON(SerializedCID)

We don’t have a canonical JSON representation of a CID anymore, so I’m not quite sure what this means. CID’s are serialized differently in different codecs.

CID.fromBaseEncodedString(BaseEncodedString) or maybe CID.parse(BaseEncodedString) CID.fromBuffer(Buffer) or better yet CID.decode(Buffer)

These seem to be more motivated by type system tooling preference than JS ergonomics. I’ve used this library quite a bit and being able to think about a serialized CID as either a string or buffer and not really caring which has been quite useful.

—-

Somewhat related, I’d like a method that I can pass a string/buffer and it will tell me if it’s a valid CID or not.

Gozala commented 5 years ago

We don’t have a canonical JSON representation of a CID anymore, so I’m not quite sure what this means. CID’s are serialized differently in different codecs.

I'm lead to believe otherwise based on the source

https://github.com/multiformats/js-cid/blob/63cd5f3dd87d8aa1549ef7b40093f07ab5e854f6/src/index.js#L10-L15 https://github.com/multiformats/js-cid/blob/63cd5f3dd87d8aa1549ef7b40093f07ab5e854f6/src/index.js#L250-L261

Gozala commented 5 years ago

These seem to be more motivated by type system tooling preference than JS ergonomics. I’ve used this library quite a bit and being able to think about a serialized CID as either a string or buffer and not really caring which has been quite useful.

Type checker can actually figure this out much better than when I do read the code, as I mentioned when I'm reading a code and see e.g. new CID(encodedCID) there is not enough context for me to tell if encodedCID is a string here or a Buffer and if I need to know that to make changes I'm forced to do a "cognitive inference" by figuring out what has being passed from where to figure that out.

That is why I'm biased toward monophonic functions, in fact if in my editor with type-checker I usually can hover the variable to know the type. However I really wish just reading source on github could provide enough context.

mikeal commented 5 years ago

I'm lead to believe otherwise based on the source

Hrm... I wonder what this is there for, cause this isn’t even the old canonical serialization :(

Gozala commented 5 years ago

Just noticed there's open issue on toJSON #57 method. Anyway my intent in regards toCID.fromJSON was to make following true: cid.equals(CID.fromJSON(cid.toJSON()))

vmx commented 5 years ago

CID.fromBaseEncodedString(BaseEncodedString) or maybe CID.parse(BaseEncodedString) CID.fromBuffer(Buffer) or better yet CID.decode(Buffer)

These seem to be more motivated by type system tooling preference than JS ergonomics. I’ve used this library quite a bit and being able to think about a serialized CID as either a string or buffer and not really caring which has been quite useful.

I'm also unsure. I still think types have huge benefits, but I also don't want to scare the JS crowd away.

Though the usage of CID throught the IPLD/IPFS code base is IMHO a special case. In the current code base you can pass around CIDs in almost any type you like (Buffers, string, CID) and so many methods need to check that. Instead I think we should change our APIs to mostly work with CID instances only. Only the high level public interfaces should take other types and covert them into a CID as soon as possible.

Anyway. The current constructor code is indeed quite strange. I currently lean towards having those typed methods, but also the CID.from(CID|BaseEncodedString|Buffer|SerializedCID):CID<a>, which would be similar to Node.js' Buffer API.

As CID is used all over the place, I'd like to get feedback from the whole @multiformats/javascript-team.

hugomrdias commented 5 years ago

i dont have strong feeling regarding typed methods or a buffer like CID.from but i think we should only have one or the other not both.

Regarding type checkers im in favour of typescript with jsdoc (maybe some type definitions), full ts or flow (flow doesn't feel right after seing FB react and jest teams drop flow for ts) are too much of an overhead.

olizilla commented 5 years ago

The js-cid constructor is gross, but it is useful when used against an existing codebase that deals with the string form, buffer form, and CID object form interchangeably. I'm with @vmx here that we should convert input to a CID instance at the boundaries, and then deal only with CID instances after that. That being the case, a CID becomes how we normalise the users input... so then it'd be rad if we can make that nice for both vanilla JS and the type fans.

achingbrain commented 5 years ago

I'm very much a fan of the Robustness Principal here - we should continue to provide an API that accepts strings, buffers, etc (either the constructor or the proposed CID.from(thing) method) and not push dealing with that complexity onto the user - who may not be using or Typescript or even know what that is (think Protoschool).

Re: adding loads of typed methods: @mikeal is right - expanding the API surface area just to make type-related tooling happier seems like the tail wagging the dog.

vmx commented 5 years ago

Regarding type checkers im in favour of typescript with jsdoc (maybe some type definitions).

I'm not a fan of JSDocs (I thought I would, but I'm not). flow is a more powerful type-checker than TypeSCript is (as this is what flow was specicially built for). Hence I prefer flow. I'm after proper types, not IDE/auto-completion support. For supporting IDEs TypeScript JSDocs would make more sense.

terichadbourne commented 5 years ago

Although I know little about the internals of IPFS/IPLD/CID/Multiformats, I'll just second @achingbrain on making things as simple as possible for the user. (Also, keeping conventions consistent between these various projects would be helpful, if that's a consideration.) The ProtoSchool lessons got much easier when we stopped having to use .toBaseEncodedString in them, and as a newcomer I'd hope that I could just pass any acceptable kind of data into the same method and get a predictable result, rather than having to pick a different method to use on each kind of data format.

Gozala commented 5 years ago

I think there are lot of assumptions here made that this proposal is to make types easier or that it would make people’s life harder - Truth is, it’s quite the opposite when I have type-checker at hand it can infer types and tell me so I don’t have to guess what is x when I see new CID(x), but if I don’t I have to do that inference by digging through caller passing x and if it’s not clear than go further in the call-chain & maybe forget what I was doing in first place.

This is not a made-up story, that is my personal experience and a reason I end up doing first iteration on typedefs so I did not had to do these side-trips to make sense of the lot of new code.

So while I agree for the writer it’s easier to write CID.from(x) and not think about it, for every reader including the author it’s a lot helpful to read CID.fromBuffer(x) and not have to go around code to figure what x is.

I also agree with @vmx ideally CID instances should be passed around & user should rarely if ever have to construct them this way, more like block.cid(), which is to suggest it’s odd to be optimizing API for writers while most of people will be readers

achingbrain commented 5 years ago

Not sure I agree - if CID.from(x) does the work figuring out what x is, then the reader doesn't have to care what x is either, only that a CID instance is returned from CID.from(x).

Gozala commented 5 years ago

@achingbrain you may disagree, but as I said it has being my struggle with IPFS - my brain might be wired differently from yours, but I often need to understand what are the things being passed around to make changes to them.

olizilla commented 5 years ago

We should re-focus what @Gozala is specifically proposing. I've nothing against the propsal to add extra static factory methods for each of the ways you can construct a CID. I was going to decompose the constructor that way in #77, but decided against it to keep the size ot the PR down.

The idea of removing the generic constructor that takes anything is proposed as a "ideally" which i read as "nice to have", and that it "may be too much of a breaking change.".

mikeal commented 5 years ago

Can we just call it fromString()? We don’t accept strings that aren’t multibase encoded so...

Gozala commented 5 years ago

Can we just call it fromString()? We don’t accept strings that aren’t multibase encoded so...

I just wanted to make method names correspond to their opposites in this case toBaseEncodedString(). That's not argument against fromString() I'm all in favor of it, in fact I would just make toBaseEncodedString be a toString (which in fact just calls prior method).

rvagg commented 5 years ago

Yes please! toBaseEncodedString() doesn't feel at home in an ecosystem where toString() regularly takes modifier arguments (Buffer is the main one that comes to mind).

mikeal commented 5 years ago

We actually alias toString() to toBaseEncodedString() https://github.com/multiformats/js-cid/blob/master/src/index.js#L246

I didn’t know this until I looked through the code, we should probably document this better and make it the recommended API.

Gozala commented 4 years ago

I would like to reboot this conversation! In the light of CID as ArrayBufferView currently overloaded CID constructor is even more problematic. I wrote a pull request against js-multiformats to illustrate CID as ArrayBufferView concept and in the process end up decomposing constructor into several methods, which I think not only fits better but also makes API cleaner, reduces margin for error and reduces complexity. Based on the I would like to reboot this proposal by incorporating feedback in this thread.

Proposed CID interface

declare class CID {
   // Just like all other ArrayBuffer views it can take underlying buffer and optional range within it
   // to construct a CID view.
   constructor(buffer:ArrayBuffer, byteOffset:number=0, byteLength:number=buffer.byteLength)
   // Just like other typed arrays you can create CID view from bytes in the typed array.
   constructor(ArrayBufferView)

   // Just like with other typed arrays (and some other JS built-ins) you can pass CID in any other representation
   // to construct an instance.
   static from(string|ArrayBuffer|ArrayBufferView|CID):CID

   // You can also create CID out of underlying data. Think of it as `Object.create` providing lower level API instead of overloading `Object` constructor.
   static create(version:number, code:number, hash:Uint8Array):CID
}

Proposed API tries to disambiguate between:

  1. Constructing a CID view of the binary representation.
  2. Constructing a CID from other representations. Note that unlike constructor it will copy bytes just like node Buffer or other typed arrays do. Constructor on the other hand will just create a CID view of the bytes.
  3. Construct a CID from known pieces. I suspect this might be controversial, so I want to call out that it has nothing to do with types. It's just
    1. Constructors concern here is to creates a view (and not underlying data)
    2. Overloading from to take multiple arguments seems wrong & inconsistent with patter used by built-ins.
      • CID.from({ version, code, multihash }) might be a reasonable option though.
    3. It seems to be consistent with Object.create.

It may make more sense to have CID.of(version, code, multihash) instead of CID.create, but it feels like a gray are in terms of consistency with built-ins (at least in my view).

rvagg commented 4 years ago

I don't buy the arguments you have for retaining constructors, making this thing feel like an ArrayBuffer would probably mean we've failed to make it a nice abstraction. Just because it might be built as an abstraction over ArrayBuffer doesn't mean we need to expose that to users, there's not a good reason to leak that upward. So if you put that aside, you run out of reasons to have constructors and static factory functions. So I'd just ditch constructors entirely, or make a single-use constructor of some kind that's justifiable. I'd be more than happy with static factory methods that were explicit about what they do.

I'm a big +1 wherever we can reduce argument overloading. It's one of the biggest curses that jQuery gave JavaScript culture that we still haven't shaken and it keeps on leading to APIs that cause far more harm than the convenience they supposedly offer.

Would this work?

declare class CID {
   static from(string|ArrayBuffer|ArrayBufferView|CID):CID
   static from(buffer:ArrayBuffer, byteOffset:number=0, byteLength:number=buffer.byteLength):CID
   static create(version:number, code:number, hash:Uint8Array):CID
}

It's a big jump from js-cid, a simple one for js-multiformats in its early stage though.

CID.from(cid) could presumably serve as the formerly proposed CID.asCID(cid) wouldn't it? Just a return cid where it's deemed that the passed argument is what it needs to be, otherwise instantiate a new instance using what can be derived from it. Or would we need a form that lets us perform a full clone?

Gozala commented 4 years ago

I don't buy the arguments you have for retaining constructors, making this thing feel like an ArrayBuffer would probably mean we've failed to make it a nice abstraction.

That is not the argument I was making. I was advocating to have a single purpose for a constructor. And I imagine that to be mostly used internally or something very low level. I imagine mostly uses through factory functions.

Just because it might be built as an abstraction over ArrayBuffer doesn't mean we need to expose that to users, there's not a good reason to leak that upward.

I think where we might be disagreeing is that I would like to portrait CID as a view over some byte range (it just happens to be the case with all the typed arrays). I find to be a better model than presenting CID as something you decode from bytes. With such a viewpoint it's not a leaky abstraction, but a deliberate emphasis.

If you have an argument against that view point, I'd like to consider that.

So if you put that aside, you run out of reasons to have constructors and static factory functions. So I'd just ditch constructors entirely, or make a single-use constructor of some kind that's justifiable. I'd be more than happy with static factory methods that were explicit about what they do.

I am not sure what exactly you are suggesting. I don't think you can ditch constructors entirely. You could probably throw and use Object.create(...) in the factory but that:

  1. JS engines aren't able to optimize that nearly as well as constructors.
  2. If all that constructor does is maps args to the instance fields, I'm not sure what would be the argument against it.

I'm a big +1 wherever we can reduce argument overloading. It's one of the biggest curses that jQuery gave JavaScript culture that we still haven't shaken and it keeps on leading to APIs that cause far more harm than the convenience they supposedly offer.

I could not agree more!

Would this work?


 declare class CID {
   static from(string|ArrayBuffer|ArrayBufferView|CID):CID
   static from(buffer:ArrayBuffer, byteOffset:number=0, byteLength:number=buffer.byteLength):CID
   static create(version:number, code:number, hash:Uint8Array):CID
}

My primary concern with from overload is that is inconsistent with JS built-in's. It also ends up overloading. I don't mind having another factory function for that if constructor seems controversial but I would much rather not overload from. Logic is from turns given input into (ArrayBuffer, byteOffset, byteLength) and delegates to constructor (or that other factory) create does the same thing.

CID.from(cid) could presumably serve as the formerly proposed CID.asCID(cid) wouldn't it? Just a return cid where it's deemed that the passed argument is what it needs to be, otherwise instantiate a new instance using what can be derived from it. Or would we need a form that lets us perform a full clone?

I thing that would be a bad idea because:

  1. CID.asCID(v) just pattern matches, it does not decode bytes or string or any of that, that is job for a CID.from(a)
  2. CID.asCID never throws, while CID.from throws if input is invalid.
    • Argument could be made to change that, but that would not be aligned with what .from does everywhere else.
  3. Bult-in .from functions always copy, not doing it may lead to unexpected behavior.
  4. CID.asCID and CID.from are meant to have very different performance profiles. Putting them under same API endpoint makes performance profile less clear (I think this is a Rust influence speaking)

More broadly, having similar API signatures (blame TS for overlooking exceptions) alone is not a good reason to combine things that play different roles.

mikeal commented 4 years ago

This has me thinking a bit. I’d like to unwind this a little and just consider what I’d want the ideal interface to be. This may not work here but may be what we want to move to in js-multiformats and might effect what we want here in order to facilitate a transition.

When you have multiple ways of instantiating the CID I think it is best to just not use the constructor and instead use class methods designed for each case. We do this already in @ipld/block and it’s pretty nice. But just as we do in the block interface, you need to design a relatively flexible constructor so that the class methods can instantiate without doing any unnecessary work.

In that case, is something more like this the best way to design CID?

const encode = (version, code, multihash) => concat(varint.encode(version), varint.encode(code), multihash)

class CID {
  constructor({code, version, bytes, multihash}) {
    if (typeof code !== ‘undefined’) this._code = code
    // and so on for version, bytes, multihash
  }
  get bytes {
    if (!this._bytes) {
      this._bytes = encode(this.version, this.code, this.multihash))
    }
    else return this._bytes
  }
  get version {
    if (!this._version) {
      if (this._bytes) this._version = varint.decode(this._bytes)
      else this._version = 1
    } 
    return this._version
  }
}

Basically, you want the constructor to just take whatever the from and create methods can parse out of the input so that you can avoid unnecessary work for certain properties until they are accessed, especially when those involve memcopies. There’s some work to do in the constructor to make sure you have enough information and present an error early if you don’t, but there’s no reason to create the entire byte representation until it’s asked for, and we could facilitate that with a slightly different design. There’s also no reason to parse out the code or the version until they are asked for, so you should be able to instantiate with just a buffer and defer any parsing until the properties are needed.

If you look at how these get used, there’s a lot of cases where you either never ask for the full byte representation or only ask for the fully byte representation, so we could avoid a lot of extra work by just deferring that processing until it is requested.

Gozala commented 4 years ago

Basically, you want the constructor to just take whatever the from and create methods can parse out of the input so that you can avoid unnecessary work for certain properties until they are accessed, especially when those involve memcopies. There’s some work to do in the constructor to make sure you have enough information and present an error early if you don’t, but there’s no reason to create the entire byte representation until it’s asked for, and we could facilitate that with a slightly different design. There’s also no reason to parse out the code or the version until they are asked for, so you should be able to instantiate with just a buffer and defer any parsing until the properties are needed.

That is exactly how I think about constructors so 👍 That being said, there is a caveat.

  1. It conflicts with the notion of "CID as ArrayBufferView" specifically because that implies that CID is bytes with just a different prototype chain. Therefor it has to have pointer to ArrayBuffer and byteOffset, byteLength.

    • Argument could be made to compute those lazily, but that has it's own problems see no 2.
    • It also makes some combination of the constructor properties problematic e.g. { bytes, multihash } can point to different ArrayBuffers rather than different offsets in the same buffer. If you pass just multihash that can't possibly point to the same ArrayBuffer as bytes or CID because it has not been created yet.
  2. Lazy computation of some fields would break the nice property we gained through CID.asCID, that is you can send, transfer & receive CIDs over the message channel without serialization.

  3. Some validation is required, because it appears that there are combinations of the properties that are invalid.

I think above trade-offs aren't worth the benefit you'll get from some lazy computation.

Not to mention that lazy computation has it's own drawbacks (as many haskellers would testify) by making performance characteristics less deterministic.

For all the above reasons I settled on something that I think is in the same spirit as you've described in https://github.com/multiformats/js-multiformats/pull/29 where factory methods pass (ArrayBuffer, byteOffset, byteLength) to constructor that it just assigns to fields

Both factories are a also smart in that they will cache thing, version, code, to avoid recomputing those later.

And if you do have CID byte range already you don't need to decode or encode, you can create a view via constructor.

I think this is a reasonable compromise that provides deterministic performance profile for each construction operation.

If my assumption is incorrect and lazy encode / decode (from / create) does provide significant value that outweighs tradeoffs, I would argue that they should not be represented via the same CID class, but via different class that CID.asCID and CID.from can recognize. That again would make them more deterministic.

Gozala commented 4 years ago

We have discussed this on the call today and tentatively reached an agreement on the following plan: @achingbrain you were not present at that part of the call, so please make sure to take moment to review this

  1. Implement from factory function.
  2. Implement create factory function.
  3. Retain constructor as is, but update all the documentation everywhere to use one of the above methods instead.