json-ld / yaml-ld

CG specification for YAML-LD and UCR
https://json-ld.github.io/yaml-ld/spec
Other
22 stars 8 forks source link

YAML-LD datatypes (and tags for datatypes) #17

Open VladimirAlexiev opened 2 years ago

VladimirAlexiev commented 2 years ago

Why might we want more than "string plus @type"?

Let's collect below examples of what we could want.


@gkellogg in https://github.com/ietf-wg-httpapi/mediatypes/issues/8#issuecomment-1034040169

If I were to revisit anything in the JSON-LD data model, it would be the interpretation of JSON numbers to allow for decimal values. As it is now, JSON numbers are either interpreted as integers (long) or doubles based on the range of the number. But, in JSON-LD 1.1, we use The JSON Canonicalization Scheme (RFC8785) as a way to represent numbers in the rdf:JSON datatype serialization, which allows for a serialization form of either integer, decimal, or double. This really only comes into play in JSON-LD when creating RDF literals from native JSON numbers (something which is generally a bad design point, but is there to allow a reasonable interpretation of native JSON forms), but could also come into play when representing those numbers in the data model, and thus in serializations to forms such as YAML.


@VladimirAlexiev from #2:

instead of long form

dc:date: {"@type": xsd:date, "@value": 2022-05-18}



---

New ones:

- is it at all feasible to write `"foo"@en` in YAML rather than a separate `@language` field?
- JSON-LD cannot capture GeoJSON because that uses nested arrays. Can this be worked around somehow with a YAML tag for "2D array"?
pchampin commented 2 years ago

JSON-LD cannot capture GeoJSON because that uses nested arrays.

This is not the case anymore with JSON-LD 1.1 (example)

ioggstream commented 2 years ago

This is another interesting direction to explore that does not seem to create inconsistencies with YAML spec, thanks Vladimir! We could then ask the YAML community if it is possible to "register" in some way the xsd namespace to support this kind of mappings and associate them to the yaml.org 1.2 namespace.

I suggest using full-URI tags in the examples for clarity, eg:

# see https://yaml.org/spec/1.2.2/#tag-directives
%TAG !xsd! tag:http://www.w3.org/2001/XMLSchema:
---
# short form using tags
dc:date: !xsd!date 2022-05-18

# instead of long form
dc:date: {"@type": xsd:date, "@value": 2022-05-18}
anatoly-scherbakov commented 2 years ago

I feel that manually specifying data types for each value is very tedious, and the tag syntax is not very intuitive. My feeling is this: why don't we delegate that task to the context?

The machine is smart enough to understand that a value of a dc:date is actually a literal with xsd:date datatype — and JSON-LD contexts can express that.

ioggstream commented 2 years ago

Can you post an example? Probably we should start collecting examples of "equivalence classes" of yaml files in this repo.

VladimirAlexiev commented 2 years ago

@ioggstream We should use the actual XSD namespace. The tag: URI scheme is recommended by the YAML people but is not mandatory, so I'd rather follow TimBL's principles of using resolvable URLs: %TAG !xsd! http://www.w3.org/2001/XMLSchema#

https://yaml.org/spec/1.2.2/#104-other-schemas allows us to make an XSD YAML scheme, and we should ask the YAML people to publish it at https://yaml.org/type/

@anatoly-scherbakov Of course if a field ALWAYS uses the same datatype, the context can provide it. But dates in instance data often come in various granularities (same with numbers). So wouldn't it be nice to write this instead of the respective long forms?

dct:created   !xsd!gYear    2000
dct:issued    !xsd!date     2022-05-18
dct:modified  !xsd!dateTime 2022-05-18T01:12:23
pchampin commented 2 years ago

@anatoly-scherbakov

My feeling is this: why don't we delegate that task to the context?

Of course we can, and that's an important role of JSON-LD contexts: making explicit some implicit constrains/dependencies (e.g. "this field expects this datatype").

However, we also need a way to make this information explicit (e.g. in the expanded form of JSON-LD). In JSON-LD, this is done with a value object {"@value": "...", "@type": "..." }. In YAML-LD, tags provide a more concise and more idiomatic way to do it.

Also, +1 to @VladimirAlexiev use-case above.

anatoly-scherbakov commented 2 years ago

@VladimirAlexiev @ioggstream that is an interesting point. When using JSON-LD, I always tried to ensure that a particular property always maps to a specific type, but I agree that this application of tags is compelling. :+1:

gkellogg commented 2 years ago

This was discussed during today's call: https://json-ld.org/minutes/2022-06-22/.

gkellogg commented 2 years ago

This issue was discussed in today's meeting.

gkellogg commented 2 years ago

I think this is a great candidate for something an extended profile could do, and something like the %TAG ! http://www.w3.org/2001/XMLSchema# seems like a great way to go.

In my mind, this isn't a direct replacement for the @type of JSON-LD value objects, but an extension of the JSON-LD internal representation, much the say that booleans and numbers are treated in the JSON-LD (specifically to/from RDF algorithms). Implementations would need to maintain the internally typed values when expanding/compacting/framing, represent them using the appropriate tag when serializing to YAML in extended mode, or expanding them to value objects when serializing in the basic mode.

The toRdf and fromRdf algorithms would need to honor them when generating RDF or turning RDF back into the internal representation, again running with the appropriate processing mode.

Otherwise, this change should be fairly transparent. IMO, this is the primary motivation for an extended profile.

rob-metalinkage commented 2 years ago

So what is actually in play here is a profile of YAML itself - the profile for which JSON-LD translations are lossless, so we dont need a profile of YAML-LD, but YAML-LD is an extension of a "YAML-JSON-compatible" profile. Such a profile could be implicit - or made explicit if multiple YAML/JSON conversions are defined. Another reason to make it explicit would be to validate if a given YAML document is compatible with YAML-LD before defining the YAML-LD extended syntax for that YAML schema.

gkellogg commented 2 years ago

I guess in my mind, the "YAML-JDON-compatible" profile is analogous to YAML using the JSON schema. This does not depend on explicit tags, but implicitly associates the values with tag:yaml.org,2002:null, tag:yaml.org,2002:bool, tag:yaml.org,2002:int, and tag:yaml.org,2002:float.

I think something like a "YAML-XSD-compatible" profile might require the use of a tag namespace such as suggested by @VladimirAlexiev: %TAG !xsd! tag:http://www.w3.org/2001/XMLSchema:, so a tagged value such as !xsd!dateTime 2022-05-18T01:12:23 would parse to a native DateTime literal, and the JSON-LD internal representation would be extended to support the various literal types from XSD.

If running in "extended", or "YAML-XSD-compatible" mode, a %TAG definition such as above would be legitimate. If not running in that mode, a processor may reject the input or use Postel's law and parse it, but it should not be emitted unless the profile is set accordingly.

In my mind, this and alias nodes are the primary think that would be enabled by an extended mode.

If a processor sees some other %TAG definition (or definitions outside of some pre-defined set) it should probably fail to process the document, which then acts as an extension point for processors to eventually support more values for %TAG in the future, but for RDF purposes, anything beyond the XSD set

Given this, I think we may be about ready to define the processing modes more completely.

rob-metalinkage commented 2 years ago

I'm thinking here about statements about conformance - :myresource dct:conformsTo - how do I know if a yaml resource is "YAML using the JSON schema." (the same holds true for the identifiers for YAML-LD and JSON-LD.)

general Use Case is to be able to determine what an API supports in terms of interoperability of data payloads. Can anyone orient me to where this is being defined or discussed? I can see inline directives such as https://yaml.org/spec/1.2.2/#681-yaml-directives, @context where a URI is referenced and $schema directives - but not where such things are registered - we have a related in IANA profiles on media types for encodings, but what about information content profiles?

Is identification of the profile out-of-band using resolvable identifiers (i.e. not in syntax-specific directives using syntax-specific keywords and versioning) a factor in defining processing modes?

TallTed commented 2 years ago

@rob-metalinkage -- Please edit your last comment, https://github.com/json-ld/yaml-ld/issues/17#issuecomment-1207728874, to put @context into a code fence (like `@context`), so that GitHub user doesn't get endlessly pinged on threads about which they do not care.

gkellogg commented 2 years ago

I've looked into this some more as part of trying to implement extended support for XSD scalar values in YAML. IMO, the appropriate %TAG value would be something like the following:

%TAG ! http://www.w3.org/2001/XMLSchema#

This would allow values such as !date 2022-08-08, which would expand as !<http://www.w3.org/2001/XMLSchema#> "2022-08-08" and be a natural way to capture "2022-08-08"^^<http://www.w3.org/2001/XMLSchema#>. However, I'm stymied by a bug in LibYAML, which Ruby and many other languages rely on for parsing YAML (https://github.com/yaml/libyaml/issues/253), where # is not accepted as a URI character (really ns-uri-char). So far, the LibYAML team has been unresponsive, and the library shows very little activity in the last couple of years. Of course, we could hack this with some other URI, but that doesn't seem appropriate for this group.

Other YAML tools show similar issues, I think largely due to the fact that that YAML spec only uses the tag scheme in its examples. Until this issue is resolved, I think we need to defer an extended mode for YAML-LD that would involve interpreting XSD datatype scalar values. The spec recommends the use of tag: (oddly), and if we were to go there, we would probably want to introduce something like %TAG ! tag:www.w3.org,2022:xsd/ but that seems quite arbitrary.

An example file I've been working with to exercise this variation is the following:

%YAML 1.2
%TAG ! http://www.w3.org/2001/XMLSchema#
---
"@context":
  "@vocab": http://xmlns.com/foaf/0.1/
name: !string Gregg Kellogg
homepage: https://greggkellogg.net/
depiction: http://www.gravatar.com/avatar/42f948adff3afaa52249d963117af7c8
date: !date 2022-08-08

(note, the use of a specific tag name shouldn't be significant. In this case, it's using the primary tag handle, but it could just as well be the secondary tag handle (!!) or a named tag handle (! xsd !) for our processing model).

If we are to support XSD types, we probably want to white-list allowed datatype URIs to include most XSD types, in addition to tag:yaml.org,2002:str, tag:yaml.org,2002:null, tag:yaml.org,2002:int, tag:yaml.org,2002:float, and tag:yaml.org,2002:bool which would map more directly to the JSON-LD Internal Representation.

See also https://github.com/yaml/yaml-spec/issues/268#issuecomment-1208565027.

gkellogg commented 2 years ago
  • is it at all feasible to write "foo"@en in YAML rather than a separate @language field?

No, I don't believe it is, however, we could consider using a datatype form such as defined for the i18n namespace:

@prefix i18n: <https://www.w3.org/ns/i18n#> .

[ ex:title "foo"^^i18n:en ] .

Although it's defined to allow a combination of language and base-direction, it can be used for just language or base direction. Of course, we would need to define that literal values using an i18n datatype consisting of only language would be translated to language-tagged literals, and visa-versa.

VladimirAlexiev commented 2 years ago

@gkellogg

VladimirAlexiev commented 2 years ago

onlineyamltools.com allows # but then complains with: Error: YAMLException: unknown tag !<http://www.w3.org/2001/XMLSchema#string> at line 6, column 28

Trying with explicit xsd tag gives the same error:

%YAML 1.2
%TAG !xsd! http://www.w3.org/2001/XMLSchema#
---
name: !xsd!string Gregg Kellogg

This tool can only use the "YAML JSON schema" builtin tags (and supports timestamp, although that has been deprecated). As expected, it can mangle numbers:

%YAML 1.2
%TAG ! tag:yaml.org,2002:
---
name:   !str Gregg Kellogg
int:    !int 123
bigint: !int 123456789012345678901231                             # -> 1.2345678901234569e+23  ouch!
bigint: 123456789012345678901231                                  # -> 1.2345678901234569e+23  ouch!
float:  !float 1.235609853907835079889067406870964870956870967908 # -> 1.235609853907835
date:   !timestamp 2022-08-08 -> 2022-08-08T00:00:00.000Z
gkellogg commented 2 years ago

My implementation needed to use a lower-level parser that just transforms YAML to the Representation Graph without further interpretation. In Ruby Psych, this is done via Psych.parse_stream. That level shouldn't place constraints on any specific schema.

gkellogg commented 2 years ago

Discussed at TPAC F2F

VladimirAlexiev commented 2 years ago

Beyond XSD: let's not forget custom datatypes, eg:

gkellogg commented 2 years ago
This was discussed on [2022-09-28](https://json-ld.org/minutes/2022-09-28/#16).
Pierre-Antoine Champin: The devil is in the details, and in the bnodes :-D
Vladimir Alexiev: I think we should use YAML tags in the form that datatypes are used for RDF.
... JSON-LD is more verbose, and the YAML syntax is more concise.
... In many case the context will relieve you of this need, but there are cases where the graph is heterogeneus
... May be a problem with parsers.
... This also relates to YAML schemas, and how to attach types.
... YAML had a schema including dates, but have backed up.
... My proposal would be that the WG will declare a %TAG |xsd| ...
... But, implementers will need to use a better parser that supports tags.
... This is also important for numbers.
... We had trouble in xxx group, where the number would be mis-interpreted.
... Then we need to look at a YAML parsers matrix to determine how widely available it is.
Gregg Kellogg: The current "spec" refers to a basic profile, which doesn't include tags but only basic YAML values
... and an Extended profile that includes XSD datatypes, and tags for URLs (is it absolute, or relative...)
... Gregg has an implementation that uses the YAML parse tree.
... Also in JSON-LD (discussion between Gregg and Antoine at TPAC), there is a movement towards handling more datatypes, and not mangling literals with default treatment of numbers
Vladimir Alexiev: What about URLs?
... In a heterogeneous dataset, the same field could contain either a string or a resource.
... can we have a single tag !id or !uri that would handle absolute, relative and CURIEs?
Gregg Kellogg: We want to explore some more use cases of URLs before deciding
Vladimir Alexiev: Can we decide this issue?
... let's not forget custom datatypes, eg geo:wktLiteral, geo:gmlLiteral, 5-10 more in GeoSPARQL 1.1, and the tentative rdf:JSON and rdf:YAML
Gregg Kellogg: Questions of quoting: is !xsd!integer '123' the same as !xsd!integer 123 and same as 123, or different?
Niklas Lindström: Author: someone!tag-key => as if author was defined in the context with "`@type`": <tag-key>; then if e.g. someone!uri was encountered, *and* uri is defined as an alias of "`@id`", this is short for {"`@id`": "someone"}
... the tag comes before the value, eg !tag-key someone
Gregg Kellogg: Tags should be declared in %TAG not in context, else we'll go against the grain of YAML
TallTed commented 2 years ago

@gkellogg -- Several unfenced @ entities are in the last several lines of the bot-posted conversation https://github.com/json-ld/yaml-ld/issues/17#issuecomment-1263840815 causing more unintended alerts to be fired in their direction.... Maybe the bot can be tweaked to codefence such entities going forward?

gkellogg commented 2 years ago

Sorry, must have been unfenced on IRC. I’ll fix them later

TallTed commented 2 years ago

Yeah, I'm sure they were unfenced on IRC. There's no consistent value to fencing there.

Weirdly, now that they're single-backtick fenced here, those backticks are showing as part of the text instead of being interpreted as markdown -- so, for instance, we now see (bold added here to help with clarity) {"`\@id`": "someone"}, where we'd expect to see {"@id": "someone"}.

I suspect this won't be a quick or easy fix, but it should be raised with the folks running the (now several!) IRC/log-to-GitHub bots.

gkellogg commented 2 years ago

Well, I handle the irc log to HTML for these minutes, which were inserted here. Perhaps could detect some bare keywords, but you’re right that the result in the comment is wrongly interpreted, but that seems like a GH issue.

TallTed commented 2 years ago

I'd suggest wrapping the larger element including the @, so {"@id": "someone"}, which makes overall sense anyway, the larger element being code.