multiformats / js-cid

CID implementation in JavaScript
MIT License
98 stars 39 forks source link

Proposal: Let's introduce `asCID` static method #111

Open Gozala opened 4 years ago

Gozala commented 4 years ago

This idea first was proposed in #109 and later came up in #110. I thought it would be best to factor it out into separate thread and discuss it here.

Problem

At the moment CID.isCID(v) is used all over the place as a glorified v instanceof CID check. However it does not check if v is instanceof CID instead it checks presence of special symbol. That addresses problem with having multiple different versions of CID implementations, however it does overlook the fact that two different version may in fact be incompatible e.g. new version may change behavior of specific method e.g. toString I think that is already happening (per https://github.com/multiformats/js-cid/issues/109#issuecomment-641640712)

In the new CID you can use .toString() with no arguments and it will give you a base32 multibase encoded string for CIDv1 and the bare base58btc encoding for CIDv0.

Even if above is backwards compatible change, incompatible change may come sooner or later.

This is a symptom of more general issue with validation pattern. If you're inclined I would invite to read excellent parse don't validate primer that explains hazzards of that problem (in the context of typed language, but it applies to this just as well)

Proposal

Changing behavior of CID.isCID(v) to do component validation (if I interperent @hugomrdias suggestion https://github.com/moxystudio/js-class-is/issues/25#issuecomment-642803835) is going to be too disruptive to be worth it. Furthermore it would not address the problem stated above. Instead I would like to propose an alternative in form of asCID static method to which existing code base could be gradually migrated and future code base should default.

Implementation could be something along these lines:

class CID {
  /**
   * If passed value is a CID or compatible JSON representation, CID instance for it is returned
   * otherwise `null` is returned.
   * returns `null`.
   * @param {any} value
   * @returns {CID|null}
   */
  asCID(value) {
    // If is CID nothing to do here
     if (value instanceof CID) {
       return value
     }
     // If CID components validate it's CID from other realm or another version of CID
     // In that case CID of this class is assembled from the components.
     else if (CIDUtil.checkCIDComponents(other) == null) {
       return new CID(value.version, value.codec, value.multihash, value.multibaseName)
     }
     // Otherwise it's not valid / compatible CID so return `null`
     else {
       return null
     }
  }
}

With such method in place existing pattern of

if (CID.isCID(cid)) {
  // use cid
}

Could gradually be transitioned to following pattern instead:

const cid = CID.asCID(value)
if (cid) {
   // use cid
}

It is a bit more verbose, but it would avoid issues that may arise from version incompatibilities. It will also preserve CID-ness when things are structured cloned.

mikeal commented 4 years ago

I’m not sure what this accomplishes.

Is this meant to convert between the new and old CID implementations?

If so, this isn’t going to work as stated because what is causing the breaking change is the fact that the new implementation doesn’t have the full codec table, so you won’t have the .codec property on the new instances and you won’t be able to instantiate with a string codec name, you have to use the integer instead.

You should be able to convert from the new CID to the old CID instance though.

However, I don’t think it’s worth writing this particular compatibility layer because there’s other changes across the new stuff as well. This is all covered by the legacy() method provided by js-multiformats, it will take a new style codec and give you back an old style codec complete with conversions to the old CID class and Uint8Array conversions back to Buffer which the old stuff also relies on.

Having code that sits in-between this migration is probably going to be more pain than it’s worth, there’s a lot of small differences covered by legacy you’ll want to deal with, and you should also consider how dependent existing consumers are on Buffer.

This isn’t just a breaking change across our code bases, it’s also a breaking change across all of our consumers. We have to consider what happens to third party consumers when we swap these type implementations out and ensure that we either maintain compatibility or cause an exception because there are many cases where you could not cause an exception but still cause incorrect behavior that consumers won’t even notice and instead be writing bad data. This is why we decided to make the new CID instance pass isCID but throw on property access of .codec.

Gozala commented 4 years ago

I’m not sure what this accomplishes.

Is this meant to convert between the new and old CID implementations?

If so, this isn’t going to work as stated because what is causing the breaking change is the fact that the new implementation doesn’t have the full codec table, so you won’t have the .codec property on the new instances and you won’t be able to instantiate with a string codec name, you have to use the integer instead.

It may not cover this specific change because as you said it may not be possible to convert without codec table. However it can cover other changes that may occur. Here is the contrived example:

V1

class Bla {
  static get typeSymbol() {
    return Symbol.for('Bla')
  }
  static isBla(v) {
    return v && v[Bla.typeSymbol]
  }
  static asBla(v) {
    if (v instanceof v) {
      return v
    } else if (v && Buffer.isBuffer(v.buffer)) {
      return new Bla(v.buffer)
    } else {
       return null
    }
  }
  constructor(buffer) {
    if (!Buffer.isBuffer(buffer)) {
      throw new Error("Argument should be a buffer")
    }
    this.buffer = buffer
  }
  toHexString() {
     return this.buffer.toString('hex')
  }
}

V2

class Bla {
  static get typeSymbol() {
    return Symbol.for('Bla')
  }
  static isBla(v) {
    return v && v[Bla.typeSymbol]
  }
  static asBla(v) {
    if (v instanceof v) {
      return v
    } else if (v && Buffer.isBuffer(v.buffer)) {
      return new Bla(v.buffer)
    } else {
       return null
    }
  }
  constructor(buffer) {
    if (!Buffer.isBuffer(buffer)) {
      throw new Error("Argument should be a buffer")
    }
    this.buffer = buffer
  }
  toHexString() {
    return this.buffer.toString('hex')
  }
  toBase64String() {
     return return this.buffer.toString('base64')
   }
}

User assuming V2

if (Bla.isBla(v)) {
  v.toBase64String()
}

If v is V1 it will fail on the other hand following will work as expected

const bla = Bla.asBla(v)
if (bla) {
  bla.toBase64String()
}

Edit: Added constructors to the examples for the completeness

Gozala commented 4 years ago

@mikeal I think you were reading this in context of new CIDs work happening in IPLD, which is likely due to me mentioning comment about toString method changes.

I think reading with that context is misleading. It is unrelated. There is problem with isCID pattern that asCID pattern would handle. Proposed pattern would also work across realms which class-is aspired to do, but fails as I'm finding in my work.

rvagg commented 4 years ago

I like this pattern and I think it elegantly solves most of the challenges here and the instanceof shortcut is really nice for speeding this up in the common case. It seems to me that you just need to come up with a satisfactory solution to the faux-CID checkCIDComponents() problem that's being discussed in #109 (a circular property would probably work but that's quite a hack!).

I think this method might be able to solve the old-CID new-CID conversion problem too, both ways: for new to old it just has to do a lookup of code to codec in the base table it already has, but for old to new, the CID constructor happens (for some reason?) to have the base table attached as codecs, so it's just a matter of doing code = cid.constructor.codecs[cid.codec] (where cid is an instantiated CID).

We do have cases in IPLD that rely on an isCID() to properly encode data and I can imagine those becoming a performance concern. e.g. in dag-cbor we have to do a deep walk of every object that gets encoded and replace CIDs with tagged CBOR types so that borc properly encodes them as TAG=42 elements. Having to use an asCID() would mean we're creating a ton of null assignments for most object properties and doing some amount of internal property inspection for a checkCIDComponents(). For isCID() we're only doing a quick boolean check on a Symbol attached to each property. These may be costs that a smart VM could deal with, but it might be useful to figure that out ahead of time.

https://github.com/ipld/js-ipld-dag-cbor/blob/master/src/util.js#L46-L48

Similar in DAG-JSON:

https://github.com/ipld/js-dag-json/blob/master/index.js#L8-L10

achingbrain commented 4 years ago

As I see it, the problem is you get a CID back from another module and you're not sure if it supports the API you expect. Historically a test would pick this up and we would solve the problem by PRing the other module to upgrade its CID dependency to the latest version, and so the new version would propagate through the ecosystem.

If this is not an option then a code change may be appropriate.

As proposed it looks like asCID is doing a bunch of the checks that the CID class does in its constructor.

Why not just add any additional checks there instead of having to add conditional logic to the calling code every time we touch a CID from another module?

Gozala commented 4 years ago

As proposed it looks like asCID is doing a bunch of the checks that the CID class does in its constructor.

It does same exact check as isCID as it’s primary goal is to replace it. There’s some overlap with what CID constructor does, but that’s just coincidence.

Why not just add any additional checks there instead of having to add conditional logic to the calling code every time we touch a CID from another module?

It really is poor mans pattern matching rather then constructor. Meaning it may return CID if value passed represents CID otherwise it returns null. Which typically constructors can’t do & I’m not sure overloading constructor in this was would be a good idea.

Argument could be made that constructor could throw instead of returning null, but that would have negative impact on usability and performance.

Gozala commented 4 years ago

Having to use an asCID() would mean we're creating a ton of null assignments for most object properties and doing some amount of internal property inspection for a checkCIDComponents(). For isCID() we're only doing a quick boolean check on a Symbol attached to each property.

Not to object properties, to local variables that and if not null only then assign. This really meant to do exact same check except null check instead of boolean.

I do mention checkCIDComponents, but it really should do what isCID does (some property chech)

achingbrain commented 4 years ago

Argument could be made that constructor could throw instead of returning null

Throwing when it encounters invalid data is exactly what the constructor does right now.

It does same exact check as isCID as it’s primary goal is to replace it.

If that's so, then just replace it. Don't give the user multiple ways of doing the same thing.

However I'd rather not add another method that does 50% of isCID's job and 50% of the constructors job.

If anything, I'd like to see v1 of this module released so we can properly communicate breaking changes through semver (e.g. not lose instanceof compatibility because a new method was added and we had to bump the 0.x.0 minor and we have a bunch of old modules in our deps that haven't had PRs submitted to them to upgrade the cids dep), class-is removed and use of instanceof instead of isCID.

Gozala commented 4 years ago

Throwing when it encounters invalid data is exactly what the constructor does right now.

Ok but here is the prominent pattern I see today within PL code base and outside of it:

if (CID.isCID(value)) {
  useCID(value)
}

// ...etc

Which following this proposal would turn into:

const cid = CID.asCID(value)
if (cid) {
  useCID(cid)
}

// ...etc

Embracing constructor here would change that into (what I'd say an awkward) usage:

try {
  const cid = new CID(value)
  useCID(cid)
} catch(_) {}
// ...etc

Which also introduces a surface for bugs where exception thrown by useCID can get swallowed unintentionally.

If that's so, then just replace it. Don't give the user multiple ways of doing the same thing.

However I'd rather not add another method that does 50% of isCID's job and 50% of the constructors job.

I agree I think long term goal is to replace isCID with asCID pattern, however I think there is a benefit to doing a gradual transition instead of a sweep requiring tight coordination. If replacing seems like a better option I'm down with that as well.

Gozala commented 4 years ago

Had a conversation with @mikeal today about this & I would like to provide a sketch as what I think we agreed was a worthy avenue to explore.

class CID {
  /**
   * @param {any} value
   * @returns {CID|null}
   */
  static asCID(value) {
     // If it is already a CID instance no allocation is necessary
     // just return it back
    if (value instanceof CID) {
      return value
    }
    // If it is not instance of CID it might be instance from different version of CID
    // library or it could be CID instance that was structure cloned. Instead of using
    // symbol we recognize `value.asCID` circular that renders `value` invalid for
    // JSON & DAG representation. In other words it's equivalent of current symbol
    // check.
    else if (value != null && value.asCID === value) {
      const {version, codec, multihash, multibaseName} = value
      return new CID(version, codec, multihash, multibaseName)
    } else {
      return null
    }
  }
  constructor(version, codec, multihash, multibaseName) {
    this.version = version
    this.codec = codec
    this.multihash = multihash
    this.multibaseName = multibaseName
    this.asCID = this
  }
  toJSON() {
    const { version, codec, multihash,  multibaseName } = this
   return { version, codec, multihash,  multibaseName }
  }
}

What this gives us is following:

Works well with structure cloning algorithm, here is the sample illustrating it:

const main = (multihash) => {
  var cid = new CID(0, 'dag-pb', multihash, 'base32')
  var {port1:sender, port2:receiver} = new MessageChannel()
  receiver.onmessage = ({data}) => {
     data instanceof CID // false
     const cid2 = CID.asCID(data)
     cid2 instanceof CID // true
  }
  sender.postMessage(cid)
}

Turns CID into non Can be serialized with JSON.serializable

Given our conversation I've gathered this was nice to have.

var cid = new CID(0, 'dag-pb', multihash, 'base32')
JSON.stringify(cid) // => {"version":0,"codec":"dag-pb", "multibaseName":"base32","multihash":{"0":190,"1":203,....}}

Existing CID.isCID can be kept for gradual transition

This assumes that CID.isCID is unchanged (uses the symbol check as it does today)

const main = (multihash) => {
  var cid = new CID(0, 'dag-pb', multihash, 'base32')
  var {port1:sender, port2:receiver} = new MessageChannel()
  receiver.onmessage = ({data}) => {
      CID.isCID(data) // => false
     const cid2 = CID.asCID(data)
     cid2 instanceof CID // => true
  }
  sender.postMessage(cid)
}

Meaning that CID.isCID will continue to work just as it does today with-in the single thread boundary. However CID instance passed across thread boundaries will fail CID.isCID(v) check (just as today). On the other hand if CID.asCID(v) is used on the receiver thread v will be recognized / casted to CID.

This would enable:

  1. Us to upgrade dag-cbor / dag-pb etc... to use CID.asCID(v) pattern without breaking any code or disrupting users.
  2. Upgrade different libraries to CID.asCID(v) without having to coordinate all changes at once.
  3. Once our libraries are updated worker thread will no longer need to de-serialize incoming data as CIDs will be recognized.

I have not performed any benchmarks, but I'm pretty confident that performance of CID.asCID(v) should be on par with CID.isCID(v).

This code was for illustration purposes so constructor was simplified from what is in the tree.

Gozala commented 4 years ago

Me and @mikeal also have talked about the idea of representing CID as data view over the underlying buffer similar to how Uint8Array. If we were to go about it it would probably look something like code below, with all the properties just being lazy getters:

class CID {
  /**
   * 
   * @param {ArrayBuffer} buffer 
   * @param {number} byteOffset 
   * @param {number} byteLength 
   */
  constructor(buffer, byteOffset, byteLength) {
    this.buffer = buffer
    this.byteOffset = byteOffset
    this.byteLength = byteLength
  }
  get version() {
    const { version } = materialize(this)
    return version
  }
  get codec() {
    const { codec } = materialize(this)
    return codec
  }
  get multihash() {
    const { multihash } = materialize(this)
    return multihash
  }
}

const materialize = (cid) => {
  const bytes = new Uint8Array(cid.buffer, cid.byteOffset, cid.byteLength)
  const firstByte = bytes[0]
  if (firstByte === 1) {
    const bytes = new Uint8Array(cid.buffer, cid.byteOffset + 1, cid.byteLength - 1)
    Object.defineProperties(cid, {
        version: { value: 1 },
        codec: { value: multicodec.getCodec(bytes) },
        multihash: { value: multicodec.rmPrefix(bytes) }
      })
  } else {
    Object.defineProperties(cid, {
      version: { value: 0 },
      codec: { value: 'dag-pb' },
      multihash: { value: bytes }
    })
  }

  return cid
}

I do think that might be a good idea, however it is orthogonal to the overall CID.asCID proposal as they are not mutually exclusive nor they depend on each other.

vmx commented 4 years ago

Turns CID into non JSON.serializable

Given our conversation I've gathered this was nice to have.

Why is it nice to have to break JSON.stringify() if your data contains CIDs?

Gozala commented 4 years ago

Why is it nice to have to break JSON.stringify() if your data contains CIDs?

Please correct me @mikeal if I got it wrong, but I think it was something to the effect of preventing error due to naive serialization.

If I did get that wrong, and if want to get JSON.stringify(cid) to work, we can add toJSON() method that omits asCID.

Gozala commented 4 years ago

I see now that JSON.stringify was also discussed at https://github.com/multiformats/js-multiformats/pull/18#issuecomment-645068878 and was decided in favor of not breaking JSON.stringify, I've updated sections to reflect that decision.

Gozala commented 4 years ago

Me @achingbrain and @hugomrdias had a discussion about this on call other day, and I will try to summarize it here.

mikeal commented 4 years ago

Now that you’re considering a breaking change (deprecating isCID) I should note that there’s a future upgrade to multiformats which is also a breaking change to CID, among other things. If you do go down that path you should probably combine it with a migration to the new interface so that you don’t suffer 2 big refactors.

Gozala commented 4 years ago

Now that you’re considering a breaking change (deprecating isCID) I should note that there’s a future upgrade to multiformats which is also a breaking change to CID, among other things. If you do go down that path you should probably combine it with a migration to the new interface so that you don’t suffer 2 big refactors.

@mikeal I have couple of questions:

  1. Is there path to changing CID such that it would be future compatible with new multiformats, but does not require switching everything to multiformats ?
  2. Is multiformats (or some pieces of it) ready to be integrated into js-ipfs ?
  3. How does ArrayBufferView approach described in the notes above fit into the multiformats ?
mikeal commented 4 years ago

@Gozala these answers are a bit complicated, I’m going to join the Core Implementations meeting on Monday and can discuss it in more detail there.

Gozala commented 3 years ago

We had a higher bandwidth exchange today and have settled on the following course of action:

Phase 1: Implement

Phase 2: Evaluate