json-ld / yaml-ld

CG specification for YAML-LD and UCR
https://json-ld.github.io/yaml-ld/spec
Other
22 stars 8 forks source link

Serializing JSON or YAML literal in YAML-LD #36

Open gkellogg opened 2 years ago

gkellogg commented 2 years ago

The YAML examples in the JSON-LD 1.1 spec (e.g., https://github.com/w3c/json-ld-syntax/blob/main/yaml/JSON-Literal-compacted.yaml), do not preserve the JSON serialization of a JSON literal.

Example 062: JSON Literal-compacted
---
"@context":
  "@version": 1.1
  e:
    "@id": http://example.com/vocab/json
    "@type": "@json"
e:
- 56.0
- d: true
  '10': 
  '1': []

It should, instead be the following:

Example 062: JSON Literal-compacted
---
"@context":
  "@version": 1.1
  e:
    "@id": http://example.com/vocab/json
    "@type": "@json"
e: [56.0,{"d":true,"10":null,"1":[]}]

But a simple YAML.dump of the parsed JSON does not take this into consideration. The spec should describe the requirements for serializing JSON literals in YAML-LD.

gkellogg commented 2 years ago

One stated goal is to be able to use something like YAML.dump of the parsed JSON/YAML, which will likely not allow defining how data is serialized in these cases. This should probably be at most a SHOULD requirement and maybe best left to an extended profile. Implementing it requires tagging the object which is the root of the JSON Literal and writing a custom emitter to serialize as JSON which is a significantly more involved serialization strategy, particularly given the need to interpret the in-scope local context to know if a map entry value should be treated as a JSON Literal.

The YAML examples cited above are generated essentially by YAML.dump(JSON.load(src)), where there is no notion of a local context.

pchampin commented 2 years ago

It seems to be that the two YAML snippets above serialize to the same JSON (and this is confirmed by a quick test on https://www.convertjson.com/yaml-to-json.htm), so I don't understand where the issue is. :thinking:

gkellogg commented 2 years ago

It’s probably a more a more philosophical question: Must a JSON Literal necessarily have the form of JSON?

VladimirAlexiev commented 2 years ago

It's also a pragmatic question:

  1. When converting to RDF, a @json literal should be treated as opaque and left alone, see https://w3c.github.io/json-ld-syntax/#the-rdf-json-datatype. I have more examples of such needs:
  2. What should a reader expect when seeing @type:@json or "..."^^rdf:JSON. If they expect JSON but find YAML, they may be unable to process it.
  3. I think we also need to declare @type:@yaml and "..."^^rdf:YAML
ioggstream commented 2 years ago

I just learn now about JSON Literals... I think it is a very complex feature if you see it as a literal, because even JSON parsers will not treat it as you might expect.

For example, a JSON Literal with duplicate keys will not be treated as literal by generic JSON parsers:

{
  "@context": {
    "@version": 1.1,
    "e": {
      "@id": "http://example.com/vocab/json",
      "@type": "@json"
    }
  },
  "e": {
    "a": "ciao",
    "a": 1
  }
}

will result in an entry with the last (or the first, it's actually implementation dependent) removed. How does JSON-LD handle these cases?

{
  "@context": { ... },
  "e": {
    "a": 1
  }
}
VladimirAlexiev commented 2 years ago

@ioggstream https://w3c.github.io/json-ld-syntax/#the-rdf-json-datatype says "The lexical space is the set of UNICODE strings which conform to the JSON Grammar". Hopefully that includes only valid JSON representations, i.e. no duplicate keys.

This is not an optional feature. It's part of the JSON-LD spec, so it must be supported in YAML-LD.

I provided a real-world use case for it: GraphDB connectors for Lucene, SOLR, Elastic (https://graphdb.ontotext.com/documentation/10.0/connectors.html#full-text-search-and-aggregation-connectors)

gkellogg commented 2 years ago

The JSON-LD Literal definition is written to allow a variation in representation. The JCS C14N considerations only come into play when describing the representation within RDF Triples. Similar to rdf:XMLLiteral it's original intent is to allow for some portion of an XML document to be referenced as a literal across different encodings (also rdf:RDFA).

The JSON-LD spec says non-normatively that values of @json (or properties with "@type": "@json" ) are treated as JSON Literals. IMO, YAML-LD is free to innovate here. As there is a simple transformation from any YAML to JSON, a value of @json could still have a more general YAML format, as long as the result can be transformed into the value space (involving JCS). That said, a SHOULD statement on using the JSON sub-set of YAML seems reasonable, and allows for implementations that cannot reasonably conform to this.

gkellogg commented 2 years ago

@VladimirAlexiev said:

  1. When converting to RDF, a @json literal should be treated as opaque and left alone, see https://w3c.github.io/json-ld-syntax/#the-rdf-json-datatype. I have more examples of such needs:

Then converting to RDF triples; a given serialization may have different ways of representation that. The JSON-LD from RDF algorithm describes the mechanism to use when transforming a triple containing an RDF Literal into JSON-LD.

  1. What should a reader expect when seeing @type:@json or "..."^^rdf:JSON. If they expect JSON but find YAML, they may be unable to process it.

Two different things. A JSON-LD processor may see JSON-LD with an explicit value of type rdf:JSON, where the value is a JCS encoded string, which would not automatically be turned into the internal @json value object representation.

  1. I think we also need to declare @type:@yaml and "..."^^rdf:YAML

I think we need demonstrate a need here. The rdf:JSON literal was not established lightly. What evidence is there for the use of YAML literals in the wild?

ioggstream commented 2 years ago

@VladimirAlexiev Afaik JSON grammar allows duplicate keys. You need JCS to forbid duplicate keys

@gkellogg

A SHOULD statement on using the JSON sub-set of YAML seems reasonable, and allows for implementations that cannot reasonably conform to this.

What do you mean with "JSON subset"? If you mean something like the "internal representation" than its feasible. Otherwise I think that we can only check that the representation graph maps to the expected JSON literal when serialised in JSON.

gkellogg commented 2 years ago

@VladimirAlexiev Afaik JSON grammar allows duplicate keys.

No, I believe this has been addressed by RFC8259:

The names within an object SHOULD be unique.

Not a MUST, but that is because of concerns over backwards compatibility. The interoperation of when duplicate keys are present is unspecified, as different implementations do different things.

Also JCS / RFC8785 prohibits objects from having duplicate keys:

JSON objects MUST NOT exhibit duplicate property names.

ioggstream commented 2 years ago

@gkellogg

treated as JSON Literals ...

Does JSON-LD use JCS or JSON? What happens in the case of the JSON literal I wrote above ? https://github.com/json-ld/yaml-ld/issues/36#issuecomment-1173637884

gkellogg commented 2 years ago

With regard to JSON Literals, the spec uses JCS. IIRC, the spec is silent on duplicate keys, and as noted in the RFCs, May have different behaviors. This is at least a SHOULD. But, for the specific car of JSON Literals, duplicate keys would violate the requirements of JCS.

gkellogg commented 2 years ago

What do you mean with "JSON subset"? If you mean something like the "internal representation" than its feasible. Otherwise I think that we can only check that the representation graph maps to the expected JSON literal when serialised in JSON.

What I meant by "JSON subset" is the subset of YAML which is, effectively JSON. I.e., the arrays, objects and native values that both YAML and JSON share. Perhaps there is another term for this.

The JSON-LD Internal Representation of a JSON Object is, however, an Infra map, which is defined specifically to have unique key/value pairs. All JSON-LD algorithms operate by transforming the JSON surface syntax into the internal representation, which will end up eliminating duplicate keys, in any case.

ioggstream commented 2 years ago

JSON Literals, the spec uses JCS

iiuc:

"JSON subset" is the subset of YAML which is, effectively JSON .. Infra map ...

Infra map: ordered sequence of key/value pairs. Keys are unique. Keys are strings. YAML: unordered sequence of key/value pairs. Keys are unique. Keys can be arbitrary nodes.

About ordering

JSON libraries do not usually preserve ordering. I suspect that it is in general not a problem since iiuc

  1. a JSON-LD parser receiving a JSON Literal will c14n it and sort JSON objects keywords
  2. @type: @json stores the JSON-LD Internal Representation and not the verbatim JSON text

About YAML-LD

IF JSON Literals are about Internal representation (the serialization always happens via JCS) then I think we do not need a @type: @yaml because the data model is always the JSON one, and serialization happens via JCS.

We only need @yaml if we decide to extend the JSON-LD data model.

WDYT?

gkellogg commented 2 years ago

This issue was discussed on the Aug 03 meeting.

TallTed commented 2 years ago

@ioggstream -- Please edit your https://github.com/json-ld/yaml-ld/issues/36#issuecomment-1174751556 and wrap code fences (either single or triple backticks) around all @terms that aren't meant to link to GitHub users (e.g., `@yaml`, `@type`, `@JSON`), because the users behind those handles probably aren't interested in our discussions and don't need alerts on every comment made here...

gkellogg commented 2 years ago

@ioggstream -- Please edit your #36 (comment) and wrap code fences (either single or triple backticks) around all @terms that aren't meant to link to GitHub users (e.g., `@yaml`, `@type`, `@JSON`), because the users behind those handles probably aren't interested in our discussions and don't need alerts on every comment made here...

I took care of it.

gkellogg commented 2 years ago

I propose closing this saying that YAML-LD has no specific encoding requirements for @json value objects as long as round-tripping YAML to JSON reproduces an equivalent structure.

ioggstream commented 2 years ago

@gkellogg can you please check if this way of using @json in YAML is consistent with the above words?

https://github.com/ioggstream/draft-polli-restapi-ld-keywords/pull/3/files

gkellogg commented 2 years ago

@gkellogg can you please check if this way of using @json in YAML is consistent with the above words?

https://github.com/ioggstream/draft-polli-restapi-ld-keywords/pull/3/files

Yes, that seems reasonable.

VladimirAlexiev commented 2 years ago

@gkellogg

I think we also need to declare @type:@yamland "..."^^rdf:YAML I think we need demonstrate a need here. The rdf:JSON literal was not established lightly. What evidence is there for the use of YAML literals in the wild?

Uh, wouldn't YAML-LD provide thousands of such examples?

I think we need to consider JSON and YAML literals completely independently of whether or not they have any relation to LD (just like rdf:XMLLiteral is not RDF XML).

Let me try to adapt our first example https://graphdb.ontotext.com/documentation/10.0/lucene-graphdb-connector.html#using-the-create-command from Turtle+JSON to YAML-LD+YAML:

'@context': 
  luc: http://www.ontotext.com/connectors/lucene#
  luc-index: http://www.ontotext.com/connectors/lucene/instance#
  ex: http://www.ontotext.com/example/wine#
  rdfs: http://www.w3.org/2000/01/rdf-schema#
luc-index:my_index:
  luc:createConnector: !yaml
    types: [ex:Wine]
    fields:
      - fieldName: grape
        propertyChain: [ex:madeFromGrape, rdfs:label]
      - fieldName: sugar
        propertyChain: [ex:hasSugar]
        analyzed: false
        multivalued: false
      - fieldName: year
        propertyChain: [ex:hasYear]
        analyzed: false

I think you'll agree that's much nicer than the original.

So it's not a question of whether we need it, but how exactly to handle it:

Note: if we change our connector implementation to use RDF instead of JSON and add a bit to the context, this becomes straight YAML-LD (notice !yaml is removed but the payload after @context is the same):

'@context': 
  luc: http://www.ontotext.com/connectors/lucene#
  luc-index: http://www.ontotext.com/connectors/lucene/instance#
  ex: http://www.ontotext.com/example/wine#
  rdfs: http://www.w3.org/2000/01/rdf-schema#
  fieldName: {'@id': luc:fieldName}
  types: {'@id': luc:types, '@type': '@id', '@collection': '@list'}
  fields: {'@id': luc:fields, '@type': '@id', '@collection': '@list'}
  propertyChain: {'@id': luc:propertyChain, '@type': '@id', '@collection': '@list'}
  analyzed: {'@id': luc:analyzed, '@type': xsd:boolean}
  multivalued: {'@id': luc:multivalued, '@type': xsd:boolean}
luc-index:my_index:
  luc:createConnector: 
    types: [ex:Wine]
    fields:
      - fieldName: grape
        propertyChain: [ex:madeFromGrape, rdfs:label]
      - fieldName: sugar
        propertyChain: [ex:hasSugar]
        analyzed: false
        multivalued: false
      - fieldName: year
        propertyChain: [ex:hasYear]
        analyzed: false

This YAML-LD will be converted to the following turtle:

luc-index:my_index
  luc:createConnector [
    luc:types (ex:Wine);
    luc:fields (
      [luc:fieldName "grape";
        luc:propertyChain (ex:madeFromGrape rdfs:label)]
      [luc:fieldName "sugar";
        luc:propertyChain (ex:hasSugar);
        luc:analyzed: false;
        luc:multivalued: false]
      [luc:fieldName "year";
        luc:propertyChain (ex:hasYear);
        luc:analyzed: false])]
TallTed commented 2 years ago

@VladimirAlexiev (or @gkellogg) -- Please edit https://github.com/json-ld/yaml-ld/issues/36#issuecomment-1251223864 and put codefences around the @type:@yaml in the opening quoted block. They don't need pinging about our conversation.

gkellogg commented 2 years ago

Done.

gkellogg commented 2 years ago
This was discussed on [2022-09-28](https://json-ld.org/minutes/2022-09-28/#31)
Vladimir Alexiev: I gave an example from Elastic Search. This connector can be used in indexing.
... Fields have types and other attributes.
... Currently, we implement this in JSON. There's a SPARQL INSERT involved.
... We've wanted to turn that into a better notation, as you can't use prefixes.
... We're thinking of converting it to proper RDF; the question is how to write it.
... If we allow JSON and YAML literals, it would help with the interpretation of that data.
... If JSON was done because it was popular, it makes sense that you be able to store YAML as a literal.
... A good example is GeoJSON. In JSON-LD 1.1, it can be interpreted.
... But, it comes out as a nested list of lists.
... There are textual formats for GeoJSON.
... I think we should have a YAML literal.
Gregg Kellogg: There's the JCS spec to canonize a JSON literal. We don't have such a thing for YAML
... the value of canonization is that then you can compare literals for equality, so that value equality will coincide with lexical equality
Vladimir Alexiev: Ok, I see but 1. RDF doesn't even canonize simple things like xsd:boolean, numbers (123 vs 0123), and even URLs
... 2. We could tackle YAML canonization, in fact I'd like to have that (and standardize pretty-printing parameters, and the ability to capture them in YAML-LD)
Gregg Kellogg: Sorry, out of time for today. We can contniue on next call. Please send in discussion topics for next meeting agenda.
Created https://github.com/json-ld/json-ld.org/issues/797 -> action 797 create a repo for NDJSON-LD [1] (on ) due 5 Oct 2022
gkellogg commented 2 years ago

From the RDF Semantics

A datatype is understood to define a partial mapping, called the lexical-to-value mapping, from a lexical space (a set of character strings) to values. The function L2V maps datatypes to their lexical-to-value mapping. A literal with datatype d denotes the value obtained by applying this mapping to the character string sss: L2V(d)(sss). If the literal string is not in the lexical space, so that the lexical-to-value mapping gives no value for the literal string, then the literal has no referent. The value space of a datatype is the range of the lexical-to-value mapping. Every literal with that type either refers to a value in the value space of the type, or fails to refer at all. An ill-typed literal is one whose datatype IRI is recognized, but whose character string is assigned no value by the lexical-to-value mapping for that datatype.

The JSON-LD 1.1 Spec defines this for the rdf:JSON literal with a lexical space composed of UNICODE strings conforming to the JSON Grammar and a value space with specific serialization requirements so that two JSON literals can be expressed, say, using different whitespace, but be considered value-equivalent through mapping to the value space via JCS.

For a hypothetical YAML datatype, the lexical space would clearly be the set of all UNICODE strings which conform to the YAML Grammar, but finding the value space is more difficult,, as multiple YAML serializations may be considered to represent the same value. I think a necessary pre-condition for establishing a YAML datatype would be to identify a normative specification for obtaining the canonical form of a YAML document/stream.