Open Gozala opened 4 years ago
Keep in mind that CID
has breaking changes in the js-multiformats
work.
There is no longer a .codec
string property, and you can’t instantiate using a string name property for the codec, because these features require that you have the entire multicodec table. Instead there is a .code
property with the integer for the codec and the constructor takes an integer rather than a string now.
We also no longer cache the multibaseName
a CID
instance was instantiated with. This introduces indeterministic behavior in the representation we decided we were better off without (we still cache multibase representations of the CID though).
So, the new CID
is actually:
{
version: 0|1,
code: int,
multihash: Uint8Array,
buffer: Uint8Array
}
What would be much better if CID had a canonical representation that is not tied to specific class.
We have a similar problem in dag-json
and solved it by using an object with a single property named “/“ that is the multibase encoded string of the CID. https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-json.md#link-kind
Kind of how ArrayBuffer represents bytes and Uint8Array is just a view.
It’s tempting to use some kind of view on the binary representation of the CID here as well but unfortunately the way we tend to support CIDv0 is through detection that depends on the string encoding.
In the new CID
you can use .toString()
with no arguments and it will give you a base32
multibase encoded string for CIDv1 and the bare base58btc
encoding for CIDv0.
So, the new CID is actually:
{ version: 0|1, code: int, multihash: Uint8Array, buffer: Uint8Array }
What's does the buffer
represent there ?
We have a similar problem in dag-json and solved it by using an object with a single property named “/“ that is the multibase encoded string of the CID. https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-json.md#link-kind
Different way to frame this issue would be, can we work out a more general solution that can be recognized across the code base instead of needing a codec specific solutions like this.
It’s tempting to use some kind of view on the binary representation of the CID here as well but unfortunately the way we tend to support CIDv0 is through detection that depends on the string encoding.
I think backwards compatibility is obviously important, however I think that we can make progress on better CID representation separate from that, with expectation that codecs may need to do some detection to recognize former encodings like base58btc strings.
In the new CID you can use .toString() with no arguments and it will give you a base32 multibase encoded string for CIDv1 and the bare base58btc encoding for CIDv0.
I'm not sure this helps with described problem. Were you suggesting to serialize CIDs via toString()
and deserialize those on the other end ?
What's does the buffer represent there ?
It’s the full binary representation of the CID as a Uint8Array.
Different way to frame this issue would be, can we work out a more general solution that can be recognized across the code base instead of needing a codec specific solutions like this.
A few years back there was a belief that we could have a “canonical JSON representation” of IPLD and that should be used everywhere as an intermediary format. This is still visible in a lot of the HTTP API’s in IPFS. That turned out not to work so well which is why we defined the “IPLD Data Model” to have top level types for Link (CID) and Binary.
{“/“: stringCID}
is actually taken from that old canonical JSON representation, so to the extent that there is a generic solution, that’s it ;)
I'm not sure this helps with described problem. Were you suggesting to serialize CIDs via toString() and deserialize those on the other end ?
I’m saying that if you want to use {“/“: stringCID}
you can reliably do it in w/ the new CID implementation with something along the lines of:
if (CID.isCID(value)) value = { “/“: value.toString() }
You actually cannot do this with require(‘cids’)
because it has indeterministic behavior in the default multibase encoding used by toString()
.
So here is rough proposal what we could do to address this problem (I'm making a lot of assumptions here, so please call out when those are incorrect).
CID.isCID
predicate that does not check canonical representation, but rather class instance.What I would like to propose is:
CID.isCID
recognize either representation.I think this could be achieved by:
CID
class a subclass of Uint8Array
so that it's bytes correspond to canonical representation. All the convenience methods, caching etc.. could be kept roughly the same.CID.isCID
recognize Uint8Array
that is binary representation of CID
regardless if the view is CID
or Uint8Array
.CID
instance as static methods of CID
class and use those internally instead of instance methods.With all the above problem described goes away, because things coming on the other end of the message channel would still represent CIDs and be treated as such. Question however is:
Uint8Array
that achieves all of the above ?
{“/“: stringCID}
is actually taken from that old canonical JSON representation, so to the extent that there is a generic solution, that’s it ;)
Are you suggesting that if I do following:
const dagCBOR = require('ipld-dag-cbor')
dagCBOR.util.serialize({ link: {'/': stringCID} })
It would be recognized as link ? As far as I can tell it will not.
More relevant question (assuming above is Yes or could be made Yes) can we change CID implementation such that once you do following:
worker.port.postMessage({ link: new CID(...) })
what comes out on the other end would be:
{ link: {'/': stringCID} }
Or at least something that dag-cbor
and all other codecs can recognize as CID
and serialize them as such.
Bridge canonical and class representations.
That’s sort of what I assumed we would end up doing. In whatever the communication bridge between the worker is you would need to serialize and deserialize CID’s with a little extra work.
Make CID.isCID recognize either representation. Update our code such that presence of CID class methods / properties is irrelevant.
I’m pretty sure this will break a lot of stuff. I’d need some time to work through all the consequences of isCID
accepting this form but my initial reaction is “please don’t.”
There are a lot of external consumers of this interface that use the properties you’d be deprecating. So much so that, even though we’re doing a breaking change to CID
in js-multiformats
, we have to make it still pass the isCID
check or else people assume it’s a plain object and a bunch of code in our ecosystem breaks without surfacing any noticeable errors.
Frankly, this is a big step backwards from IPLD’s point of view. Link
being a top level type in the Data Model unlocked a lot of stuff for us and it gave us a clear path for how to implement codecs across various formats and identify the optimal path for Data Model support in each language.
This isn’t the first time we’ve had to figure out how to represent a Link
type within a specific set of type constraints. It’s always a bit of work but the end result is quite nice because you end up with a fairly unique object you can reason about. Pushing in the other direction, making the type less distinguishable, isn’t the direction I would push in. There has to be a way to solve this problem in the communication channel between the workers that would then have minimal impact on the broader ecosystem.
worker.port.postMessage({ link: new CID(...) })
This implies a requirement that I don’t believe we have.
I would assume that these worker processes aren’t just responding to every random message anyone decides to post. The API will actually be something along the lines of
const resp = await ipfsWorker.send({ link: new CID(...) })
And that will turn into something along the lines of
worker.port.postMessage({ ipfsMessage: { msgID: id, data: { link: { “/“: stringCID }}})
The other side can then de-serialize that form into a CID instance before it hands it off to other code. I don’t see a compelling reason why all libraries, and the entire ecosystem of modules, needs to move towards accepting the serialized {“/“: stringCID}
form.
Yes, this means we suffer some extra base encoding/decoding but unless we want to drop support for CIDv0 I think we’re going to end up eating that anyway.
’m pretty sure this will break a lot of stuff. I’d need some time to work through all the consequences of isCID accepting this form but my initial reaction is “please don’t.”
There are a lot of external consumers of this interface that use the properties you’d be deprecating. So much so that, even though we’re doing a breaking change to CID in js-multiformats, we have to make it still pass the isCID check or else people assume it’s a plain object and a bunch of code in our ecosystem breaks without surfacing any noticeable errors.
These are good points. I am also recognizing that overloading CID.isCID
is not necessarily what is needed to address the problem. It was a way, which has enough downsides to not be worth it.
What about the following instead:
CID.isCID
that recognizes either representation.Which I could be achieved as follows:
port.postMessage(new CID(...))
comes out as Placeholder
on the other end.CID.match(input:any):CID|null
that returns CID
if input is either CID
or a Placeholder
Update internal codebase (mostly codecs at this point) such that instead of doing following(quoting inline below)
if (CID.isCID(obj)) {
// do something with CID
}
We will do following instead
const cid = CID.match(obj)
if (cid) {
// do something with CID
}
Figuring out what the Placeholder
here should be is still required, but this would remove need for two extra traversals on both ends of the channel.
Frankly, this is a big step backwards from IPLD’s point of view. Link being a top level type in the Data Model unlocked a lot of stuff for us and it gave us a clear path for how to implement codecs across various formats and identify the optimal path for Data Model support in each language.
I am not suggest to not make Link
a non to level type, I'm just suggesting finding a representation which can cross thread boundaries without loosing it's type information.
I would assume that these worker processes aren’t just responding to every random message anyone decides to post. The API will actually be something along the lines of
It is not however some message payloads do not have predefined structures (as they come from user space) and there for could contain CID
s that come out as non CID
s on the other end. Which implies following (quoting from problem statement):
- Sender needs to deep-copy arbitrary data structures to swap CIDs with some encoded version.
- Which browser will structure clone (another copy).
- Receiver needs to traverse inputs to find endoced CIDs and swap those with decoded instances (I think copying can be avoided here)
Motivation here is
CID
s come out on the other end (after they are structure cloned) such that receiver can recognize those. Which would remove need for step 1.Getting both would be ideal, but getting even 1 would be good.
So here is a more concrete example based on input I got here and discussion with @mikeal on a different sync channel.
I understand that having a canonical top level type is important for IPLD team and using shapes to recognize types feels like a step backwards. However I think there are ways to achieve a representation that can't be accidentally be mistaken for something else e.g.:
class CID extends Blob {
constructor(...args) {
const {version, code, multihash, buffer } = parseCID(...args)
super([ buffer ], { type: "application/cid" })
this.version = version
this.code = code
this.multihash = multihash
this.buffer = buffer
}
// all the existing methods go here
static match(input) {
if (input instanceof CID) {
return input
} else if (input instanceof Blob && input.type === "application/cid") {
return new CID(input)
}
}
}
port.postMessage({ link: new CID(...) })
// comes out on the other end as `blob` with type `'application/cid'`
// Only thing codecs would have to change is instead of doing `CID.isCID(input)` switch to
const cid = CID.match(input)
if (cid) {
// ...
}
I would like to also note that this is not just about passing things around worker threads. It's bit more general web platform thing, e.g storing things in IndexedDB has exact same constraints as in it goes through structured cloning. I believe that picking a representation that is compatible with structured cloning constraints is going to be win long term.
Just as with node Buffer
s it provides some convenience, but it is also comes with very similar trade-offs that motivates leaving it behind.
One thing that came out in conversation with @mikeal
because it means that you can’t represent data that wanted to use
{'/':someOtherMeaninfulString}
Acyclic nature of DAG could be exploited to represent links that otherwise could not be part of the DAG - Represent it with a cycle. e.g. cid.cid === cid
.
That way there is no chance that { link: cid }
could mean anything else if (link.cid.cid == link.cid)
as it would be invalid DAG node. Cycles are also compatible with structured clone algorithm so the assertion would hold across thread boundaries as well.
@Gozala can you explain this last comment a bit more? I don't understand how this is relevant to the discussion here.
fwiw I really like the idea of subclassing Uint8Array but see the current isCID()
as a very big problem, and the usage is very wide (and beyond our control) that changing the semantics of it could be very disruptive for users. But maybe there's a bridge that offers a migration path while maintaining the current style for the cases where it's used now.
The new js-multiformats work has messed up isCID()
already though in its current format least, to the point where it's impractical to mix code that uses both cids.CID
and multiformats.CID
. Either that needs to be addressed or we just accept that js-multiformats is a totally new thing and use that as an opportunity to do things differently.
@Gozala can you explain this last comment a bit more? I don't understand how this is relevant to the discussion here.
So as I understand main concern with not relying on cid
been instance of CID
was that there is no reliable way to tell if certain structure is CID or just a structure that resembles a shape. Example of dag-json
encoding links as {'/':someOtherMeaninfulString}
was used, where it may be an actual JSON structure instead a link.
Which lead me to idea that we could take advantage of asyclic nature of DAG to represent links (CIDs) in a (cyclic) shape that otherwise would be invalid.
fwiw I really like the idea of subclassing Uint8Array but see the current
isCID()
as a very big problem, and the usage is very wide (and beyond our control) that changing the semantics of it could be very disruptive for users. But maybe there's a bridge that offers a migration path while maintaining the current style for the cases where it's used now.
The way I see it CID.isCID
could be a more conservative check than cid = CID.asCID(input)
. I would say it's equivalent to how bunch of libraries today check for Buffer.isBuffer(v)
that I have being gradually changing to if (v instanceof Uint8Array) v = Buffer.from(v.buffer, v.byteOffset, v.byteLength)
. Same could be done to migrate from CID.asCID(input)
to cid = CID.asCID(v)
, which would not break existing usage, but would make CID
s more usable in new use cases.
The new js-multiformats work has messed up isCID() already though in its current format least, to the point where it's impractical to mix code that uses both cids.CID and multiformats.CID. Either that needs to be addressed or we just accept that js-multiformats is a totally new thing and use that as an opportunity to do things differently.
I feel like I don't have enough context here.
Few note here:
CID.isCID
is flawed pattern, which I've attempted to outline and propose alternative in #111, which may also address the issue I have being having here.undefined
is not valid JSON, but is structure clone-able, in other words something along these lines isThisIPLDCIDInstance in cid && cid.isThisIPLDCIDInstance === undefined
would be both collision free tag.
In the context https://github.com/ipfs/js-ipfs/issues/3022 I'm running into problem with
CID
s that is not too dissimilar from ones that come fromBuffer
s. I would like to bring those up here for a discussion.Problem
When you post things across browsing contexts data browsers perform structured cloning which implies that only handful of things will come out with the same prototype chain on the other end. Including table from MDN below:
Supported types
lastIndex
is not preserved.Blob
File
FileList
ImageBitmap
ImageData
This implies that node
Buffer
s come out asUint8Arrays
(as they are sub-classes) andCID
s come out as plain objects, that is following most of the time:Which is problematic because e.g. dag-cbor encoder will not recognize such structures as
CID
and would encode it as JSON. In other words only way for things to work across message-channel we need either to:DeepCopy+Encode -> Structure Clone -> Traverse+Decode
Structure Clone -> Traverse+Decode
Neither of two is ideal. What would be much better if
CID
had a canonical representation that is not tied to specific class. Kind of howArrayBuffer
represents bytes andUint8Array
is just a view.I am recognizing that this might be almost impossible change to make at this point, however it is still worse having a discussion I think. Furthermore there might be certain compromises that may provide workable compromise. E.g. if all the IPFS internals did recognized canonical representation which is not tied to CID class that would remove need for decode / encode at least on messages going from client to server. I can also imagine that new
Block
API might be made to recognize that canonical representation which would effectively remove need for arbitrary traversals and limit CID encode / decode to very specific API endpoints./cc @mikeal @vmx @achingbrain @hugomrdias @lidel