multiformats / js-cid

CID implementation in JavaScript
MIT License
97 stars 39 forks source link

CIDs across the message channels #109

Open Gozala opened 4 years ago

Gozala commented 4 years ago

In the context https://github.com/ipfs/js-ipfs/issues/3022 I'm running into problem with CIDs that is not too dissimilar from ones that come from Buffers. I would like to bring those up here for a discussion.

Problem

When you post things across browsing contexts data browsers perform structured cloning which implies that only handful of things will come out with the same prototype chain on the other end. Including table from MDN below:

Supported types

Object type Notes
All primitive types However, not symbols.
Boolean objects
String objects
Date
RegExp lastIndex is not preserved.
Blob
File
FileList
ArrayBuffer
ArrayBufferView Including other typed arrays.
ImageBitmap
ImageData
Array
Object Only plain objects (e.g. from object literals)
Map
Set

This implies that node Buffers come out as Uint8Arrays (as they are sub-classes) and CIDs come out as plain objects, that is following most of the time:

{
  version: 0|1,
  codec: string,
  multihash: Uint8Array,
  multibaseName: string
}

Which is problematic because e.g. dag-cbor encoder will not recognize such structures as CID and would encode it as JSON. In other words only way for things to work across message-channel we need either to:

  1. DeepCopy+Encode -> Structure Clone -> Traverse+Decode

    • Sender needs to deep-copy arbitrary data structures to swap CIDs with some encoded version.
    • Which browser will structure clone (another copy).
    • Receiver needs to traverse inputs to find endoced CIDs and swap those with decoded instances (I think copying can be avoided here)
  2. Structure Clone -> Traverse+Decode

    • CID class can be made recognizable to remove need for deep-copy + CID encoding.
    • Browser will structure clone.
    • Receiver will traverse inputs to find structure cloned CIDs and swap those with decoded instances

Neither of two is ideal. What would be much better if CID had a canonical representation that is not tied to specific class. Kind of how ArrayBuffer represents bytes and Uint8Array is just a view.

I am recognizing that this might be almost impossible change to make at this point, however it is still worse having a discussion I think. Furthermore there might be certain compromises that may provide workable compromise. E.g. if all the IPFS internals did recognized canonical representation which is not tied to CID class that would remove need for decode / encode at least on messages going from client to server. I can also imagine that new Block API might be made to recognize that canonical representation which would effectively remove need for arbitrary traversals and limit CID encode / decode to very specific API endpoints.

/cc @mikeal @vmx @achingbrain @hugomrdias @lidel

mikeal commented 4 years ago

Keep in mind that CID has breaking changes in the js-multiformats work.

There is no longer a .codec string property, and you can’t instantiate using a string name property for the codec, because these features require that you have the entire multicodec table. Instead there is a .code property with the integer for the codec and the constructor takes an integer rather than a string now.

We also no longer cache the multibaseName a CID instance was instantiated with. This introduces indeterministic behavior in the representation we decided we were better off without (we still cache multibase representations of the CID though).

So, the new CID is actually:

{
  version: 0|1,
  code: int,
  multihash: Uint8Array,
  buffer: Uint8Array
}

What would be much better if CID had a canonical representation that is not tied to specific class.

We have a similar problem in dag-json and solved it by using an object with a single property named “/“ that is the multibase encoded string of the CID. https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-json.md#link-kind

Kind of how ArrayBuffer represents bytes and Uint8Array is just a view.

It’s tempting to use some kind of view on the binary representation of the CID here as well but unfortunately the way we tend to support CIDv0 is through detection that depends on the string encoding.

In the new CID you can use .toString() with no arguments and it will give you a base32 multibase encoded string for CIDv1 and the bare base58btc encoding for CIDv0.

Gozala commented 4 years ago

So, the new CID is actually:

{
 version: 0|1,
 code: int,
 multihash: Uint8Array,
 buffer: Uint8Array
}

What's does the buffer represent there ?

We have a similar problem in dag-json and solved it by using an object with a single property named “/“ that is the multibase encoded string of the CID. https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-json.md#link-kind

Different way to frame this issue would be, can we work out a more general solution that can be recognized across the code base instead of needing a codec specific solutions like this.

It’s tempting to use some kind of view on the binary representation of the CID here as well but unfortunately the way we tend to support CIDv0 is through detection that depends on the string encoding.

I think backwards compatibility is obviously important, however I think that we can make progress on better CID representation separate from that, with expectation that codecs may need to do some detection to recognize former encodings like base58btc strings.

In the new CID you can use .toString() with no arguments and it will give you a base32 multibase encoded string for CIDv1 and the bare base58btc encoding for CIDv0.

I'm not sure this helps with described problem. Were you suggesting to serialize CIDs via toString() and deserialize those on the other end ?

mikeal commented 4 years ago

What's does the buffer represent there ?

It’s the full binary representation of the CID as a Uint8Array.

Different way to frame this issue would be, can we work out a more general solution that can be recognized across the code base instead of needing a codec specific solutions like this.

A few years back there was a belief that we could have a “canonical JSON representation” of IPLD and that should be used everywhere as an intermediary format. This is still visible in a lot of the HTTP API’s in IPFS. That turned out not to work so well which is why we defined the “IPLD Data Model” to have top level types for Link (CID) and Binary.

{“/“: stringCID} is actually taken from that old canonical JSON representation, so to the extent that there is a generic solution, that’s it ;)

I'm not sure this helps with described problem. Were you suggesting to serialize CIDs via toString() and deserialize those on the other end ?

I’m saying that if you want to use {“/“: stringCID} you can reliably do it in w/ the new CID implementation with something along the lines of:

if (CID.isCID(value)) value = { “/“: value.toString() }

You actually cannot do this with require(‘cids’) because it has indeterministic behavior in the default multibase encoding used by toString().

Gozala commented 4 years ago

So here is rough proposal what we could do to address this problem (I'm making a lot of assumptions here, so please call out when those are incorrect).

What I would like to propose is:

  1. Bridge canonical and class representations.
  2. Make CID.isCID recognize either representation.
  3. Update our code such that presence of CID class methods / properties is irrelevant.

I think this could be achieved by:

  1. Making a CID class a subclass of Uint8Array so that it's bytes correspond to canonical representation. All the convenience methods, caching etc.. could be kept roughly the same.
  2. Make CID.isCID recognize Uint8Array that is binary representation of CID regardless if the view is CID or Uint8Array.
  3. Provide convenience methods / accessors of the CID instance as static methods of CID class and use those internally instead of instance methods.

With all the above problem described goes away, because things coming on the other end of the message channel would still represent CIDs and be treated as such. Question however is:

  1. Does CID binary representation contains all the information to provide a reliable way of identifying it as such (not be confused with some other binary data) ?
    • If not can we add some kind of tag and reserve that in multiformats table ?
  2. Is there better representation that is not subclass of Uint8Array that achieves all of the above ?
Gozala commented 4 years ago

{“/“: stringCID} is actually taken from that old canonical JSON representation, so to the extent that there is a generic solution, that’s it ;)

Are you suggesting that if I do following:

const dagCBOR = require('ipld-dag-cbor')
dagCBOR.util.serialize({ link: {'/': stringCID} })

It would be recognized as link ? As far as I can tell it will not.

More relevant question (assuming above is Yes or could be made Yes) can we change CID implementation such that once you do following:

worker.port.postMessage({ link: new CID(...) })

what comes out on the other end would be:

{ link: {'/': stringCID} }

Or at least something that dag-cbor and all other codecs can recognize as CID and serialize them as such.

mikeal commented 4 years ago

Bridge canonical and class representations.

That’s sort of what I assumed we would end up doing. In whatever the communication bridge between the worker is you would need to serialize and deserialize CID’s with a little extra work.

Make CID.isCID recognize either representation. Update our code such that presence of CID class methods / properties is irrelevant.

I’m pretty sure this will break a lot of stuff. I’d need some time to work through all the consequences of isCID accepting this form but my initial reaction is “please don’t.”

There are a lot of external consumers of this interface that use the properties you’d be deprecating. So much so that, even though we’re doing a breaking change to CID in js-multiformats, we have to make it still pass the isCID check or else people assume it’s a plain object and a bunch of code in our ecosystem breaks without surfacing any noticeable errors.

Frankly, this is a big step backwards from IPLD’s point of view. Link being a top level type in the Data Model unlocked a lot of stuff for us and it gave us a clear path for how to implement codecs across various formats and identify the optimal path for Data Model support in each language.

This isn’t the first time we’ve had to figure out how to represent a Link type within a specific set of type constraints. It’s always a bit of work but the end result is quite nice because you end up with a fairly unique object you can reason about. Pushing in the other direction, making the type less distinguishable, isn’t the direction I would push in. There has to be a way to solve this problem in the communication channel between the workers that would then have minimal impact on the broader ecosystem.

mikeal commented 4 years ago

worker.port.postMessage({ link: new CID(...) })

This implies a requirement that I don’t believe we have.

I would assume that these worker processes aren’t just responding to every random message anyone decides to post. The API will actually be something along the lines of

const resp = await ipfsWorker.send({ link: new CID(...) })

And that will turn into something along the lines of

worker.port.postMessage({ ipfsMessage: { msgID: id, data: { link: { “/“: stringCID }}})

The other side can then de-serialize that form into a CID instance before it hands it off to other code. I don’t see a compelling reason why all libraries, and the entire ecosystem of modules, needs to move towards accepting the serialized {“/“: stringCID} form.

Yes, this means we suffer some extra base encoding/decoding but unless we want to drop support for CIDv0 I think we’re going to end up eating that anyway.

Gozala commented 4 years ago

’m pretty sure this will break a lot of stuff. I’d need some time to work through all the consequences of isCID accepting this form but my initial reaction is “please don’t.”

There are a lot of external consumers of this interface that use the properties you’d be deprecating. So much so that, even though we’re doing a breaking change to CID in js-multiformats, we have to make it still pass the isCID check or else people assume it’s a plain object and a bunch of code in our ecosystem breaks without surfacing any noticeable errors.

These are good points. I am also recognizing that overloading CID.isCID is not necessarily what is needed to address the problem. It was a way, which has enough downsides to not be worth it.

What about the following instead:

  1. Bridge canonical and class representation.
  2. Provide something along the lines of CID.isCID that recognizes either representation.
  3. Update our code such that presence of CID class methods / properties is not a required.

Which I could be achieved as follows:

  1. Make sure that port.postMessage(new CID(...)) comes out as Placeholder on the other end.
  2. Introduce static method e.g. CID.match(input:any):CID|null that returns CID if input is either CID or a Placeholder
  3. Update internal codebase (mostly codecs at this point) such that instead of doing following(quoting inline below)

    if (CID.isCID(obj)) {
      // do something with CID
    }

    We will do following instead

    const cid = CID.match(obj)
    if (cid) {
      // do something with CID
    }

Figuring out what the Placeholder here should be is still required, but this would remove need for two extra traversals on both ends of the channel.

Gozala commented 4 years ago

Frankly, this is a big step backwards from IPLD’s point of view. Link being a top level type in the Data Model unlocked a lot of stuff for us and it gave us a clear path for how to implement codecs across various formats and identify the optimal path for Data Model support in each language.

I am not suggest to not make Link a non to level type, I'm just suggesting finding a representation which can cross thread boundaries without loosing it's type information.

Gozala commented 4 years ago

I would assume that these worker processes aren’t just responding to every random message anyone decides to post. The API will actually be something along the lines of

It is not however some message payloads do not have predefined structures (as they come from user space) and there for could contain CIDs that come out as non CIDs on the other end. Which implies following (quoting from problem statement):

  1. Sender needs to deep-copy arbitrary data structures to swap CIDs with some encoded version.
    1. Which browser will structure clone (another copy).
    2. Receiver needs to traverse inputs to find endoced CIDs and swap those with decoded instances (I think copying can be avoided here)

Motivation here is

  1. Make CIDs come out on the other end (after they are structure cloned) such that receiver can recognize those. Which would remove need for step 1.
  2. Would be nice to make some changes to the internals so that step 3 is unnecessary.

Getting both would be ideal, but getting even 1 would be good.

Gozala commented 4 years ago

So here is a more concrete example based on input I got here and discussion with @mikeal on a different sync channel.

I understand that having a canonical top level type is important for IPLD team and using shapes to recognize types feels like a step backwards. However I think there are ways to achieve a representation that can't be accidentally be mistaken for something else e.g.:

class CID extends Blob {
   constructor(...args) {
     const {version, code, multihash, buffer } = parseCID(...args)
     super([ buffer ], { type: "application/cid" })
     this.version = version
     this.code = code
     this.multihash = multihash
     this.buffer = buffer
   }
   // all the existing methods go here 

   static match(input) {
     if (input instanceof CID) {
       return input
     } else if (input instanceof Blob && input.type === "application/cid") {
       return new CID(input)
     }
  }
}

port.postMessage({ link: new CID(...) })
// comes out on the other end as `blob` with type `'application/cid'`

// Only thing codecs would have to change is instead of doing `CID.isCID(input)` switch to
const cid = CID.match(input)
if (cid) {
  // ...
}
Gozala commented 4 years ago

I would like to also note that this is not just about passing things around worker threads. It's bit more general web platform thing, e.g storing things in IndexedDB has exact same constraints as in it goes through structured cloning. I believe that picking a representation that is compatible with structured cloning constraints is going to be win long term.

Just as with node Buffers it provides some convenience, but it is also comes with very similar trade-offs that motivates leaving it behind.

Gozala commented 4 years ago

One thing that came out in conversation with @mikeal

because it means that you can’t represent data that wanted to use {'/':someOtherMeaninfulString}

Acyclic nature of DAG could be exploited to represent links that otherwise could not be part of the DAG - Represent it with a cycle. e.g. cid.cid === cid.

That way there is no chance that { link: cid } could mean anything else if (link.cid.cid == link.cid) as it would be invalid DAG node. Cycles are also compatible with structured clone algorithm so the assertion would hold across thread boundaries as well.

rvagg commented 4 years ago

@Gozala can you explain this last comment a bit more? I don't understand how this is relevant to the discussion here.

fwiw I really like the idea of subclassing Uint8Array but see the current isCID() as a very big problem, and the usage is very wide (and beyond our control) that changing the semantics of it could be very disruptive for users. But maybe there's a bridge that offers a migration path while maintaining the current style for the cases where it's used now.

The new js-multiformats work has messed up isCID() already though in its current format least, to the point where it's impractical to mix code that uses both cids.CID and multiformats.CID. Either that needs to be addressed or we just accept that js-multiformats is a totally new thing and use that as an opportunity to do things differently.

Gozala commented 4 years ago

@Gozala can you explain this last comment a bit more? I don't understand how this is relevant to the discussion here.

So as I understand main concern with not relying on cid been instance of CID was that there is no reliable way to tell if certain structure is CID or just a structure that resembles a shape. Example of dag-json encoding links as {'/':someOtherMeaninfulString} was used, where it may be an actual JSON structure instead a link.

Which lead me to idea that we could take advantage of asyclic nature of DAG to represent links (CIDs) in a (cyclic) shape that otherwise would be invalid.

fwiw I really like the idea of subclassing Uint8Array but see the current isCID() as a very big problem, and the usage is very wide (and beyond our control) that changing the semantics of it could be very disruptive for users. But maybe there's a bridge that offers a migration path while maintaining the current style for the cases where it's used now.

The way I see it CID.isCID could be a more conservative check than cid = CID.asCID(input). I would say it's equivalent to how bunch of libraries today check for Buffer.isBuffer(v) that I have being gradually changing to if (v instanceof Uint8Array) v = Buffer.from(v.buffer, v.byteOffset, v.byteLength). Same could be done to migrate from CID.asCID(input) to cid = CID.asCID(v), which would not break existing usage, but would make CIDs more usable in new use cases.

The new js-multiformats work has messed up isCID() already though in its current format least, to the point where it's impractical to mix code that uses both cids.CID and multiformats.CID. Either that needs to be addressed or we just accept that js-multiformats is a totally new thing and use that as an opportunity to do things differently.

I feel like I don't have enough context here.

Gozala commented 4 years ago

Few note here:

  1. CID.isCID is flawed pattern, which I've attempted to outline and propose alternative in #111, which may also address the issue I have being having here.
  2. Impression I'm getting from the conversation is that somehow use of symbol to differentiate CID from other types is only acceptable way to do it. I would like to invite you to reconsider this position.
    • Only thing that makes them "specially fit" is that using e.g. string property for tagging could collide with otherwise valid JSON. However there are other ways to tag that which would also be free of that problem:
      1. Cycle in the structure as mentioned in https://github.com/multiformats/js-cid/issues/109#issuecomment-642298233 which is structure clone-able can't collide with otherwise valid DAG node (or JSON).
      2. An undefined is not valid JSON, but is structure clone-able, in other words something along these lines isThisIPLDCIDInstance in cid && cid.isThisIPLDCIDInstance === undefined would be both collision free tag.
    • If there are other good reasons to be so invested in current representation, it would be good to list those to see if those could be addressed with alternative design.
  3. Through out the exchange I'm getting feedback that performance impact of serialization / deserialization is going to be negligible. So I feel like I calling explicitly that primary motivation is to reduce unnecessary complexity, and not the performance to be gained from this.