json-ld / json-ld.org

JSON for Linked Data's documentation and playground site
https://json-ld.org/
Other
859 stars 152 forks source link

JSON-LD model should be abstracted to allow for binary JSON and Yaml #463

Closed gkellogg closed 7 years ago

gkellogg commented 7 years ago

For example, people seem to be using CBOR as a serialization, but going through JSON to be able to get the JSON-LD semantics. Others have discussed YAML-LD.

Some work has already been put into the algorithms to use WebIDL datatypes instead of JSON to describe processing. A section on mapping from other serializations to this model would allow JSON-LD to be used in other contexts.

msporny commented 7 years ago

I've had multiple CBOR-LD discussions with @jbenet and we're both fans of the concept. There is a good bit to learn from IPLD as well. The @digitalbazaar folks have been kicking around the idea of adding YAML-LD and XML-LD output to the JSON-LD playground. Those two are fairly trivial to do... just haven't had the spare cycles to do it.

I'm a bit concerned about using WebIDL in the JSON-LD algorithms as WebIDL is fairly specific to the browser world and difficult to learn/parse (for non-W3C people). I'd prefer that we keep the algorithms as generic prose, but understand that there are different viewpoints on this.

So, in summary: +1 to expand JSON-LD's data model to make sure we support things like full round-tripping to/from CBOR-LD, IPLD, YAML-LD, and XML-LD.

/cc @dlongley @davidlehn @lanthaler

gkellogg commented 7 years ago

The only bit from WebIDL that is used is the dictionary concept, as something more abstract than JSON Object. In Ruby, it's a Hash.

JSON object In the JSON serialization, an object structure is represented as a pair of curly brackets surrounding zero or more key-value pairs. A key is a string. A single colon comes after each key, separating the key from the value. A single comma separates a value from a following key. In JSON-LD the keys in an object must be unique. In the abstract representation a JSON object is equivalent to a dictionary (see [WebIDL]).

Interestingly, WebIDL does not define array.

I'm certainly open to other suggestions.

lanthaler commented 7 years ago

I definitely see value for a high-performant, concise, binary representation for Linked Data. However, I don't feel JSON-LD would be a good base for that. We have lots of complexity to allow people to make their representation look like as they want. If the representation is binary, none of that is needed. The only thing that counts in such a format is efficiency and performance. All the rest should be an API on top of the format, not embedded into the format.

The @digitalbazaar folks have been kicking around the idea of adding YAML-LD and XML-LD output to the JSON-LD playground. Those two are fairly trivial to do... just haven't had the spare cycles to do it.

Please don't. The plethora of RDF syntaxes that already exist are more than enough. They already cause enough confusion and fragmentation.

gkellogg commented 7 years ago

It's hard to see a good argument against having the algorithms work against an abstract model, rather than a specific syntax. In some cases, the wording is, odd: "Create a new empty JSONObject", as this describes a serialization, not something you would expect to be updated. A dictionary/hash is really the thing which is created, and eventually serialized as a JSON Object.

Binary JSON representations, such as CBOR, are also actually used. While I don't think we need to discuss these in the API document, maintaining a layer of abstraction allows us to get out of the way of such interpretations.

A similar issue is what we faced in CSV on the Web, where the source is often not actually a CSV. Using a layer of abstraction improved this.

An abstract syntax is one of the things I think RDF got right. Working at they layer of syntax is one of the problems of things like JSON Schema, IMHO.

In any case, do you think the current use of dictionary in the algorithms presents a problem? It would be silly for people needing an efficient representation for RDF to find themselves creating something new, when JSON-LD so adequately handles their use cases for everything other than surface syntax. Personally, I'm not interested in YAML or XML.

lanthaler commented 7 years ago

It's hard to see a good argument against having the algorithms work against an abstract model, rather than a specific syntax. In some cases, the wording is, odd: "Create a new empty JSONObject", as this describes a serialization, not something you would expect to be updated. A dictionary/hash is really the thing which is created, and eventually serialized as a JSON Object.

Sure, using more generic terminology like dictionary or map instead of JSONObject is fine with me. As that really only changes a few terms, I don't think it will help much with the underlying goal of an efficient binary representation. I didn't do any benchmarks but I guess parsing is several orders of magnitudes faster than the JSON-LD processing. So, unless we change the processing (which likely means eliminate most of it) I doubt a binary representation wouldn't be noticeably faster.

gkellogg commented 7 years ago

The goal of a binary representation isn't necessarily to be faster, but to have a more compact representation over the wire. It seems to be something that the WoT folks are interested in; I just think that JSON-LD shouldn't get in the way of doing this.

Indeed, JSON-LD algorithms impose some requirements for sorting that make streaming impossible (some of which are unnecessary to achieve results in the test suite).

nichtich commented 7 years ago

A digital document encoded in CBOR, YAML, BSON, Smile etc. (there are more, I have collected such data structuring languages in my PhD) can be converted to JSON (or internally mapped to JSON) before processing it as JSON-LD - so what? Additional encodings of JSON are out of the scope of JSON-LD specification. In the same way it is irrelevant whether a JSON-LD document is read from a file system on USB drive or in a virtual machine, isn't it?

jbenet commented 7 years ago

My goal would be to have something like CBOR-LD and stay in CBOR-LD, never having to get to json. This makes "the JSON-LD idea" (not "JSON"-LD per-se, but "Tree-based, simple, pragmatic LD"). Never hitting json is also a requirement for many perf heavy applications. When processing things quickly (<=1us), parsing things is just out of the question. It also helps a lot with crypto applications where things needs to be consistently in one buffer to hash or sign. for me, parsing and stringifying all over the place does not pass the red-face test.


On the data model, we've learned a lot about mapping data models to each other. the requirement we have in IPLD land is that things need to be able to round-trip in both directions between formats, and yield exactly the same bytes. effectively:

fmt2-from-fmt1( fmt1-from-fmt2(a2) ) == a2
fmt3-from-fmt1( fmt1-from-fmt3(a3) ) == a3
fmt3-from-fmt2( fmt2-from-fmt3(a3) ) == a3
...

This means having a good mapping between the serialization formats. This is in theory annoying, but in practice easy. It does cause us some annoyances when ingesting things that don't cleanly fit, or causes us to do things like add a couple of missing types in-band in json (there are formats for this already, like EJSON).

Honestly, I wish there was a nice, fundamental format subset (JSON isnt even it... given it has no proper ints), but there isn't. the world is very messy. It has pushed me in the direction of self-describing schemas/translators. but that's another story.

gkellogg commented 7 years ago

Closed via #485.

mixis commented 6 years ago

I am not sure if #485 adequately solves this issue. As far as I can see, the internal representation still does not provide basic data types like bytestrings or timestamps. Like @jbenet wrote above, I'd rather avoid mandatory string conversions for cryptographic applications. Decoding (and validating) ISO8601 datetime strings is very expensive in comparison to a simple integer comparison.

On a side note, I have a hard time following derived specifications on how to actually implement them. For example, https://w3c.github.io/vc-data-model/#proofs-e-g-signatures

How can I compute signatureValue? BaseXX decoding, internal JSON representation, canonicalization algorithm, signature suite, BaseYY encoding? It all seems mindboggingly complex when all that is needed is a byte range (or ranges if one does not want to include the framing) and an off-the-shelf signature algorithm.

Another example: https://w3c-ccg.github.io/did-spec/#public-keys

I posit, that this zoo of publicKey formats results from a lack of native datatypes. Note that publicKeyPem introduces yet another format to support.

IMHO, CBOR appears to fit a wide variety of use cases and the standardized conversion from and to JSON should offer a reasonable path to migrate. On a cursory reading, I quite like how COSE (https://tools.ietf.org/html/rfc8152) is set up. There's a way to extract a machine readable data description language from the document, there's a registry https://www.iana.org/assignments/cose/cose.xhtml and also one for CBOR itself https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml

I hope that I have not grossly misunderstood this issue and the related fix and that this issue will be reconsidered.

gkellogg commented 6 years ago

@mixis it would be great to keep the issues separate between JSON-LD, DiD and Verifiable Claims, as they really touch on different things.

While the abstraction of JSON-LD to a format consistent with YAML or CBOR opens the door to actual specifications, it does not describe format-specific literal types such as byte strings or timestamps, which would be outside the scope of this spec. Specifications or Notes for YAML-LD, CBOR-LD or whatever would be free to provide such extensions, although there are likely some round-tripping issues to consider.

While this group could publish notes for alternative specs, it would be great to have a format champion involved.

cc/ @msporny @dlongley

mixis commented 6 years ago

@gkellogg I did not mean to weave issues with other projects into this issue, just illustrate that difficulties may arise downstream.

If I understand you correctly, basing JSON-LD on the CBOR types and their conversion to/from JSON is not going to happen. If other users of JSON-LD see value in using something like CBOR-LD, how should that be done? A new document, that copies most of JSON-LD, like the relationship to RDF, or a document that refers to JSON-LD and defines the differences? I imagine the latter approach to be somewhat hard to read, but the former may diverge more.

gkellogg commented 6 years ago

CBOR may a bit more challenging, as it’s a binary format, but it would be good if there were some way to show examples. Note that in the WG repo, we automatically create YAML versions of all the examples.

After discussing the mappin from CBOR to the internal representation, it’s likely that the only algorithms that need to be changed are Value Expansion and Value Compaction, but there are likely a couple of other places that need to be touched. So, a spec which describes the transformation and updates to specific algorithms would be the way to go. Otherwise, oat algorithms should need little change. The spec would also need to extend the syntax document to describe CBOR-specific native values, and examples.

If the base specs should be changed to better allow such extension, that shouldn’t be a problem.

To be published by the JSON-LD WG as a Note, the authors really should be part of the WG, and you’d be welcome. Alternatively, the Cag could publish this, but it will have more prominence as a WG publication.

jonnycrunch commented 6 years ago

I'm in agreement with @mixis regarding COSE and his example of the did public keys just highlights the limitations of JSON-LD for the DID spec. I am now understanding @jbenet point he made to @ChristopherA and myself at the DWeb summit that we should be signing the CBOR of the DID document not the JSON. COSE would be handy as we need a deterministic approach and could be the building blocks of the MultiKey approach. I have been working on IPLD (uses dag-cbor) for representation of the DID document and have made the case at the recent RWoT meeting for why it is a more secure representation. Here is the working draft of the paper: https://github.com/jonnycrunch/rwot7/blob/master/draft-documents/ipld_did_documents.md (feedback is welcome).