IPLD Data Model - Githubissues

davidar commented 9 years ago

So, reading #4, there seems to be a few separate issues:

what gets encoded into the wire format
how that gets represented in a human-readable format (JSON, YAML, etc)
how IPLD maps to language-specific datatypes

A major deficiency in JSON is its lack of (user-defined) datatypes. Several workarounds to this issue have been proposed, by reserving a special key in each JSON object:

_type: https://www.npmjs.com/package/typed-json
__proto__: http://tobyho.com/2009/10/02/typed-deserialization-with/

It looks like the @context key proposed in #4 is trying to achieve the same thing.

Whilst this is a reasonable solution for encoding into JSON, I don't think it should be a fundamental part of IPLD, as other representations actually have proper support for representing type information:

CBOR tags: https://tools.ietf.org/html/rfc7049#section-2.4
YAML user-defined datatypes: https://en.m.wikipedia.org/wiki/YAML#Data_types

It would be nice if these features could be supported by IPLD.

So, on the wire, we could have CBOR-tagged data, encoding this into JSON would give something like:

{"@some_reserved_key":"identifier for Person type",
 "name":"David"}

or in YAML:

!Person { name: David }

and mapping into native (say, JS) datatypes:

class Person {
  constructor(name) {
    this.name = name;
  }
}
object = ipld.decode(data, {'person type': Person})

CC: @jbenet

mildred commented 9 years ago

The type information in CBOR is only meant for generic types, universally understandable. It allow tagging binary strings as bigint, or integer as timestamps. It doesn't allow arbitrary tagging of objects for application specific purpose like Person.

Note: there is little room to add new tags in CBOR. The number of tags is limited to a few code points, and additional codes should be registered with IANA.

This is exactly why the Linked Data directives were created, to link object semantic (and not really typing) to the actual data. Semantic is defined by type URI, like XML namespaces. The URI is the unique identifier that tells how to interpret the data.

JSON-LD permits more than just the CBOR tags can (as for YAML, I don't know enough). The only problem with JSON-LD is that not anything can be encoded with it. Most JSON documents can be updated to fit the LD format, but this does not always come free.

This is why we are trying to construct a JSON dialect that can encode both arbitrary JSON (without LD information) and JSON-LD, while allowing us to add meta information of our own on the data.

davidar commented 9 years ago

Uh, the cbor spec says tags 256 to 18446744073709551615 are available for registration, so it's not that limited. In any case you only need one tag to specify (schema uri, data) pairs, which I doubt would be too hard to get registered.

All I'm saying is I don't think ipld should be hobbled by json's deficiencies (lack of typing information or metadata in general), given that json is only one of several possible human readable representation formats, and json isn't even involved if you're decoding wire format directly into native data structures.

If you want the full extensibility of jsonld, then you're free to use that on top of ipld. However, my personal opinion is that ipld at its core should only require the necessary features for it to be able to function properly, agnostic to any particular surface encoding or programming language data model (to the extent this is possible). Otherwise it's not future proof (first it was XML, now json, ...).

davidar commented 9 years ago

Another quote from the spec:

A secondary purpose is to allow optional tagging when the decoder is a generic CBOR decoder that might be able to benefit from hints about the content of items. Understanding the semantic tags is optional for a decoder; it can just jump over the initial bytes of the tag and interpret the tagged data item itself.

Which I think fits with what we're doing.

mildred commented 9 years ago

Uh, the cbor spec says tags 256 to 18446744073709551615 are available for registration, so it's not that limited. In any case you only need one tag to specify (schema uri, data) pairs, which I doubt would be too hard to get registered.

Yes, that's not so limited indeed (I didn't knew there were so much available ids). And as you say, you can't just use one id per schema. These are just raw integers. You actually need to embed the data in an array with the first item being the schema identifier.

Do you know of a CBOR extension to allow that specifically ? If so, that would be better than using JSON-LD in the first place. And we could output JSON-LD using a conversion step.

I think our data model should allow the Linked Data model as a first class citizen. This model is quite universal, is used in many places, and there is an extensive vocabulary defined for it.

If you look in what I did already on IPLD, I think most of it could be implemented as tagging: https://github.com/ipfs/go-ipld/pull/7 (instead of going to the lengths of escaping the @ character in JSON keys to allow adding arbitrary directives)

davidar commented 9 years ago

Do you know of a CBOR extension to allow that specifically ? If so, that would be better than using JSON-LD in the first place. And we could output JSON-LD using a conversion step.

Not off the top of my head, but I can certainly investigate.

I think our data model should allow the Linked Data model as a first class citizen.

What about separating IPLD objects into metadata and data components (so you don't need to reserve special keys in the data), and you could use the metadata part for storing LD directives, etc?

If you look in what I did already on IPLD, I think most of it could be implemented as tagging: #7 (instead of going to the lengths of escaping the @ character in JSON keys to allow adding arbitrary directives)

Sounds good to me :)

By the way, I'm not trying to change any of the JSON(-LD) stuff you've been working on, I'm just trying to make sure the core IPLD data model is as general as possible. I guess we're both coming at it with different use cases in mind (I'm interested in transparently persisting native datastructures to ipfs).

mildred commented 9 years ago

What about separating IPLD objects into metadata and data components (so you don't need to reserve special keys in the data), and you could use the metadata part for storing LD directives, etc?

By the way, I'm not trying to change any of the JSON(-LD) stuff you've been working on, I'm just trying to make sure the core IPLD data model is as general as possible. I guess we're both coming at it with different use cases in mind (I'm interested in transparently persisting native datastructures to ipfs).

That's what I understood we wanted (and why JSON-LD was good but not quite what we wanted). Hence, this is why we came up until now to the JSON key escaping (with \ everywhere, yay). But if we could implement this more efficiently when using smarter serialization format, I'm all for it.

davidar commented 9 years ago

@mildred I can't see any existing CBOR extensions for dealing with this, so I think the best approach would be to register some tag(s) with IANA.

@jbenet Thoughts? I know you don't want any CBOR magic, but I think that making this separation clear in the wire format is a much nicer solution than reserved/escaped keys within the data itself. This would make it easier to map to encodings like YAML that do have extra support for (some) object metadata. It's also fully compatible with JSON, we would just move the reserved/escaped keys into the JSON encoder/decoder rather than the wire format itself.

jbenet commented 9 years ago

Having a data model that is trivially expressible in any format is key. I.e. i should be able to take any object and go into JSON, CBOR, XML, and so on, one-to-one (i.e. roundtrip).

The reason for the JSON data model is that JSON is super, super easy to work with and it's used all over. (i understand this was the same for XML once.) i want to make it extremely easy to use IPFS, particularly web devs. And embedding more complex representations -- i.e. typed -- onto json can be done, and is something we could solve with JSON-LD, JSON-Schema, and so on.

One thing to put you at ease is that multicodec allows us to upgrade the protocol to a new format some day, just as we're upgrading from the first protobuf fmt.

Another thing to keep in mind is that this is blocking IPNS improvements and a number of other things, so we decided to move fwd with JSON data model (which is a very safe choice) to make fwd progress. We can upgrade to add typing within that model, so it's likely fine even for wanting typed things. (If i'm not seeing what you mean though please give more examples?)

davidar commented 9 years ago

@jbenet Sure, something is better than nothing, and we can always fix it later. However, I think a small change to make the wire format a little less ugly would be easier to do now than later once entrenched.

So, my reading of one of your comments is that IPLD would require a @context key to be used for every IPLD object? And the purpose of this @context key is essentially to supply typing information (or schema if you like, but that's just arguing over words)? All the other metadata is optional, and can just be done on top of IPLD at the application level?

If I'm wrong and @context is optional, and not required for IPLD to function, then I agree that it shouldn't be built into the IPLD data model, and can just be done on top of the JSON data model.

Otherwise, if required for IPLD to function, it seems like it would make sense to make @context explicit at the wire level, rather than trying to shove it into an arbitrary key in a dictionary (and then handling the ensuing naming conflicts).

One of the examples @mildred posted could look like the following YAML (for example):

ipld_object:
  !unixfsdir
  attrs:
    mode: 0775
  entries:
    some-dir:
      !unixfsdir
      attrs: ...
      entries:
        file@.txt:
          !unixfsfile
          attrs: ...
          content: !mlink "/ipfs/Qm..."

Decoding into JS object (for example), you'd supply appropriate constructors for unixfsdir, unixfsfile and mlink, and it would give you back objects as instances of the proper class (rather than just a bunch of unstructured dictionaries). You could default standard types like mlink to handlers for transparently traversing the merkledag.

Edit: I'm not trying to make it harder for webdevs --- what I'm suggesting seems easier than handing back unstructured dictionaries that they have to manually decode?

mildred commented 9 years ago

Otherwise, if required for IPLD to function, it seems like it would make sense to make @context explicit at the wire level, rather than trying to shove it into an arbitrary key in a dictionary (and then handling the ensuing naming conflicts).

I second this. Especially since the JSON-LD spec (where the idea of @context came from) specifically allows providing a context separately from the actual JSON file. They show an example using HTTP headers: http://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld

@davidar The thing about adding a type information in YAML is very nice (and you could associate the type name to the type definition the context file). Unfortunately JSON doesn't support such type informaion, and adding a directive (@type for example) would still require to escape all the other JSON keys to avoid conflicts.

The other solution is do as we do now, consider that any object with a mlink property that is a string to be a merkle-link.

In any case, @jbenet, is it possible to take a decision on these points :

are @context required, or just a possibility given to the document author?
if so, what about having the context specified out of the JSON data structure, or do we want to keep it inside?
would we want to allow other directives (such as @type), in which case we must still escape JSON keys? or do we say that there are no other directives allowed (the context file can still contain some information) and we remove all the weird character escape mechanism?
would we still want to add other IPLD specific @ directives to the JSON (and thus still require to escape @ characters by \@ in JSON keys), or is @context the only IPLD specific directive (and we can get rid of the weird JSON key escaping, I'd actually be happy with that even though #7 would be dramatically simplified)

davidar commented 9 years ago

Unfortunately JSON doesn't support such type informaion, and adding a directive (@type for example) would still require to escape all the other JSON keys to avoid conflicts.

Yes, but that would be a property of the JSON encoder rather than IPLD itself

davidar commented 9 years ago

@jbenet also, what actually exists at /ipfs/<hash-of-mlink>/mlink?

Edit: also, I'm still a bit confused about the whole @context thing, and I suspect the the hypothetical webdevs will be too

jbenet commented 9 years ago

Edit: I'm not trying to make it harder for webdevs --- what I'm suggesting seems easier than handing back unstructured dictionaries that they have to manually decode?

though i agree with you, most of the web dev community does not. this is why the many js class things continue to be unused, json serializing is all still raw objects, and protobuf is also unused.

what i do agree with them on is that "the simple case should be as simple as possible" -- eg. i shouldn't have to create classes or constructors or anything to serialize simply {"hello": "world"}

Otherwise, if required for IPLD to function, it seems like it would make sense to make @context explicit at the wire level, rather than trying to shove it into an arbitrary key in a dictionary (and then handling the ensuing naming conflicts).

I second this. Especially since the JSON-LD spec (where the idea of @context came from) specifically allows providing a context separately from the actual JSON file. They show an example using HTTP headers: http://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld

this is not trivial to do with nice programmatic interfaces. I don't see how this would be exposed to the user nicely. I think escaping is easier to reason about.

please keep in mind that the reason the web is using json (and not rdf, and not xml, and not protobuf, and not ASN.1, and not XDR, and ... ) is the level of programmatic simplicity. this is paramount.

@davidar The thing about adding a type information in YAML is very nice (and you could associate the type name to the type definition the context file). Unfortunately JSON doesn't support such type informaion, and adding a directive (@type for example) would still require to escape all the other JSON keys to avoid conflicts.

i support types. it's why i was drawn to JSON-LD in the first place.

is there any cross-serialization-format typing definition? i.e. something that's the same in JSON, YML, and so on, and trivial re-coders would get right without any special work?

are @context required, or just a possibility given to the document author?

depends on how we want to handle mlinks.

i always want to allow user to input valid json without an @context.
if we require it, we would add it.
talking with @diasdavid we figured we could get mlink things working first, and go from there, since our mlink thing is doable with a context
i'm leaning towards not required, but not sure.

would we want to allow other directives (such as @type), in which case we must still escape JSON keys? or do we say that there are no other directives allowed (the context file can still contain some information) and we remove all the weird character escape mechanism?

if @context is a thing, others will be too.

would we still want to add other IPLD specific @ directives to the JSON (and thus still require to escape @ characters by \@ in JSON keys), or is @context the only IPLD specific directive (and we can get rid of the weird JSON key escaping, I'd actually be happy with that even though #7 would be dramatically simplified)

i think if we go the escape route, we should escape all @keys, and give the user two functions, one that accepts input and escapes it, and one that takes input as is (so users can do escaping themselves and manipulate @things etc.).

jbenet commented 9 years ago

@jbenet also, what actually exists at /ipfs//mlink?

AFAIU, just a file like:

{
  "@context": {
    "mlink": "<uri-to-mlink>"
  }
}

i think. but <uri-to-mlink> would be content addressed so .... \o/. i asked about this in the #json-ld channel but was told there's no relative linking there, even to self... I said we would figure it out and propose a fix.

Edit: also, I'm still a bit confused about the whole @context thing, and I suspect the the hypothetical webdevs will be too

Yeah... just handling mlink ourselves and ditching context altogether is certainly the simplest thing to do. sadface :/ :/

davidar commented 9 years ago

please keep in mind that the reason the web is using json (and not rdf, and not xml, and not protobuf, and not ASN.1, and not XDR, and ... ) is the level of programmatic simplicity. this is paramount.

Lots of webdevs are familiar with and use YAML though. I agree that APIs are usually JSON, but YAML is also quite common for data serialisation. I believe YAML to also have good library support, and am not aware of any concerns over programmatic simplicity.

i support types. it's why i was drawn to JSON-LD in the first place.

is there any cross-serialization-format typing definition? i.e. something that's the same in JSON, YML, and so on, and trivial re-coders would get right without any special work?

@jbenet Both YAML and CBOR both have a tagging system that can be used to specify types quite naturally. Some quotes from the YAML (1.2) spec:

section 3.1.1 Each YAML node requires, in addition to its kind and content, a tag specifying its data type. Type specifiers are either global URIs, or are local in scope to a single application. For example, an integer is represented in YAML with a scalar plus the global tag “tag:yaml.org,2002:int”. Similarly, an invoice object, particular to a given organization, could be represented as a mapping together with the local tag “!invoice”. This simple model can represent any data structure independent of programming language.

section 3.2.1.2 YAML represents type information of native data structures with a simple identifier, called a tag. Global tags are URIs and hence globally unique across all applications. The “tag:” URI scheme is recommended for all global YAML tags. In contrast, local tags are specific to a single application. Local tags start with “!”, are not URIs and are not expected to be globally unique. YAML provides a “TAG” directive to make tag notation less verbose; it also offers easy migration from local to global tags. To ensure this, local tags are restricted to the URI character set and use URI character escaping.

YAML does not mandate any special relationship between different tags that begin with the same substring. Tags ending with URI fragments (containing “#”) are no exception; tags that share the same base URI but differ in their fragment part are considered to be different, independent tags. By convention, fragments are used to identify different “variants” of a tag, while “/” is used to define nested tag “namespace” hierarchies. However, this is merely a convention, and each tag may employ its own rules. For example, Perl tags may use “::” to express namespace hierarchies, Java tags may use “.”, etc.

YAML tags are used to associate meta information with each node. In particular, each tag must specify the expected node kind (scalar, sequence, or mapping). Scalar tags must also provide a mechanism for converting formatted content to a canonical form for supporting equality testing. Furthermore, a tag may provide additional information such as the set of allowed content values for validation, a mechanism for tag resolution, or any other data that is applicable to all of the tag’s nodes.

JSON doesn't have native support, but there are a few existing conventions using reserved keys that I mentioned in OP (_type and __proto__). Of course you then have to deal with naming conflicts, but this should be quite simple to do in the en-/de-coder, and people are unlikely to be using keys like __proto__ for some other purpose anyway.

mildred commented 9 years ago

In PR #7 We came to agreement that the @ character should be reserved. Normal JSON keys containing the @ character are escaped to replace it with \@. For the moment, it is thought to use it for JSON-LD directives, but we can imagine adding some directives of our own.

@davidar what tagging would you like that isn't included in JSON-LD (the spec)? JSON-LD permits already to specify a type by adding a @context key. What's nice about it is that the context can be processed and the object type can be understood by the computer.

If you want something more, I suggest you imagine a scheme involving the special @ character (that will be escaped in IPLD) or the escape character \. Try to find something that wouldn't clash with JSON-LD directive names.

mildred commented 9 years ago

Also, JSON-LD has provisions for a @type key that specifies the type of a literal value (encoded in a string). This can be used for dates for example.

davidar commented 9 years ago

@mildred I don't have a problem with @directives, I just don't think they should be part of IPLD itself, but instead layered on top of IPLD.

@type key ... This can be used for dates for example.

And the CBOR spec recommends using tags 0 and 1 for this purpose. It seems kind of pointless using CBOR as the wire format if we aren't even going to follow the spec.

@jbenet A couple of other thoughts:

could we escape the literal @ as @@ instead of \@ ("\\@" in JSON)? That way, it would only introduce one special character instead of two. The way it is now, if I have a key named foo\@bar, I'd then have to escape it to foo\\@bar, which then becomes "foo\\\\@bar" in JSON, which is a little bit confusing compared to just "foo\\@@bar". [This is similar to the way SQL escapes quotes by writing two (single) quotes.]
renaming IPLD to something like IPS(tructured)D would probably make it less likely to be confused with RDF

mildred commented 9 years ago

And the CBOR spec recommends using tags 0 and 1 for this purpose. It seems kind of pointless using CBOR as the wire format if we aren't even going to follow the spec.

This could be a nice thing to use that would avoid escaping characters as we plan to do. Perhaps we can do as follows:

if we use CBOR backend, use the tag
if we serialize to JSON: escape the keys

@jbenet, what do you think if the keys to the Node are no longer strings bur a struct { string, tag bool }. This way we don't escape strings in the data structure, only when serializing (depending on the value of tag).

could we escape the literal @ as @@ instead of \@ ("\@" in JSON)? That way, it would only introduce one special character instead of two. The way it is now, if I have a key named foo\@bar, I'd then have to escape it to foo\@bar, which then becomes "foo\@bar" in JSON, which is a little bit confusing compared to just "foo\@@bar". [This is similar to the way SQL escapes quotes by writing two (single) quotes.]

I would like that :-) But we ruled that out I think :-/ See discussion in PR #7

we can say we double the character to escape it (Ada style "I ""quote"" my words", I like it, it'simple).

This is common, but i've seen it confuse many people. So i've taught many programming classes in the past, so i've a sense of what trips people up, and having different escape characters across programs can be a very confusing thing. People can grasp double escapes much easier (though yes, it's a bit annoying too), than recalling which escape character is for which layered system.

davidar commented 9 years ago

This could be a nice thing to use that would avoid escaping characters as we plan to do. Perhaps we can do as follows:

if we use CBOR backend, use the tag

if we serialize to JSON: escape the keys

:+1:

I would like that :-) But we ruled that out I think :-/ See discussion in PR #7

Ah, fair enough. The discussion in #7 is quite long so I missed that point :)

jbenet commented 9 years ago

@mildred I don't have a problem with @directives, I just don't think they should be part of IPLD itself, but instead layered on top of IPLD.

Yeah i would appreciate the separation, but we need something for making sense of the { mlink: <ipfs-path> } construction, and that's at the native level.

It seems kind of pointless using CBOR as the wire format if we aren't even going to follow the spec.

We're not using CBOR for all the CBOR features, just like we're not providing XML encoding for all the XML features. Everything we write MUST have a 1:1 mapping to JSON. That is a strict requirement.

And the CBOR spec recommends using tags 0 and 1 for this purpose.

This could be a nice thing to use that would avoid escaping characters as we plan to do. Perhaps we can do as follows:

if we use CBOR backend, use the tag

if we serialize to JSON: escape the keys

This might be fine, as long as we can guarantee a straightforward 1:1 mapping between the CBOR rep and the JSON rep.

Again, i'm not against using features of the native formats. I'm against breaking any 1:1 mapping across the formats. IPLD is not about maximizing the utility of each native format, it's about creating something extremely easy to work with across platforms and across physical computers. Supporting the pure JSON model is one reason we moved away from protobuf (though we kept a 1:1 mapping for those objects).

@jbenet, what do you think if the keys to the Node are no longer strings bur a struct { string, tag bool }. This way we don't escape strings in the data structure, only when serializing (depending on the value of tag).

No, keys like that are very, very annoying to work with. Instead, just make the datastructure have special functions defined that return things of interest. Or define a subpackage (in this repo) with lots of these nice utils.

renaming IPLD to something like IPS(tructured)D would probably make it less likely to be confused with RDF

Hmm possibly.

It is linked data, even if it isn't "Linked Data". One of my goals is making "linked data" much easier to work with, which might involve using IPLD as a stepping stone toward the RDF model (with the directives re-mapping arbitrary JSON structures to JSON-LD).

Also worth doing soon is a trivial layering of RDF and Turtle on top of IPLD. (i.e. using IPLD as a transport, probably a large array or something). People already asking for this.

mildred commented 9 years ago

Yeah i would appreciate the separation, but we need something for making sense of the { mlink: <ipfs-path> } construction, and that's at the native level.

This is difficult, mostly because there are so many ways to do that using JSON-LD. We have few solutions:

Implement a full JSON-LD processor in IPLD, and make sense of whatever is in the context to parse links
Require the links to be formatted in a JSON-LD compatible way, but do not allow full range of JSON-LD.

Now, there are two ways to implement this in JSON-LD:

consider the link to be a typed literal value. Those values are generally of the form { @type: <literal type>, @id: <literal value> }. We can arrange for the literal value to have the mlink key however.
- advantage: there are not so many different ways to do it. We can be compatible with many JSON-LD notations with little effort
- drawback: the @type is a full URL (Linked Data schema) and it takes a non zero amount of bytes. If you have many many links, it can take some space
Consider the links a structured value (as opposed to literal). Typically, it will look like {@context: <mlink context>, mlink: <hash>}.
- advantage: it's a structured value, we can put other data in it and understood by JSON-LD
- drawback: the @context, if repeated, can take some space in the serialized data as well

We can get away and not put @type or @context in every link, but that implies having a more general context that has to be understood. That means implementing a full JSON-LD processor (or restrict further what is put in the context, but it starts to get complicated). But perhaps that's still the best option we have.

bar commented 9 years ago

@davidar please don't use my handle to document your code :p thank you!

whyrusleeping commented 9 years ago

@bar could you say hi to @foo for us sometime?

davidar commented 9 years ago

we need something for making sense of the { mlink: } construction, and that's at the native level.

I would suggest registering a cbor tag specifically for that purpose, which would minimise overhead at the wire level.

Everything we write MUST have a 1:1 mapping to JSON. That is a strict requirement.

I never suggested otherwise. CBOR tags can be mapped 1:1 to whatever json escaping scheme you want.

it's about creating something extremely easy to work with across platforms and across physical computers.

:+1: That's all I've been trying to do here. IMO cbor tags would be easier to work with than escaping keys at the wire level

It is linked data, even if it isn't "Linked Data".

Hence the confusion ;)

PS: @foo @bar @baz :p

bar commented 9 years ago

?

davidar commented 9 years ago

@bar sorry, couldn't help myself. For the record, I escaped your handle properly in my comment, it was @mildred who pinged you by quoting me ;)

eminence commented 8 years ago

Crosslinking with https://github.com/ipfs/specs/pull/37

davidar commented 8 years ago

I think this issue has been resolved elsewhere, so closing

ipld / go-ipld-deprecated

IPLD Data Model #8