json-ld / yaml-ld

CG specification for YAML-LD and UCR
https://json-ld.github.io/yaml-ld/spec
Other
22 stars 8 forks source link

File signature #7

Closed nichtich closed 2 years ago

nichtich commented 2 years ago

As a data consumer I want an indicator to tell me that a file is probably YAML-LD So that I know when to expect YAML-LD

Strict checking whether a YAML document is valid YAML-LD requires to follow the full specification. Nevertheless some kind of magic file number would be useful. As suggested here a YAML global tag could be used for this purpose (see RFC 4151):

!<tag:json-ld.org,2022>
$context: http://schema.org/
$type: Person
name: Pierre-Antoine Champin

YAML processors will raise a "unknown tag" error when trying to process the document without knowledge of YAML-LD. It can still be parsed as valid YAML but there is no default mapping to JSON. This is not a bug, but a feature.

pchampin commented 2 years ago

I want an indicator to tell me that a file is probably YAML-LD

Assuming a specific content-type gets registered, would this answer your issue? Or do you want the indicator to be in the content itself?

ioggstream commented 2 years ago

@nichtich some comments:

  1. If we want to rely on YAML, we should consider that files could be normalized like that:
%YAML 1.2
---
!<tag:json-ld.org,2022>
$context: http://schema.org/
$type: Person
name: Pierre-Antoine Champin

This means that we cannot rely on magic numbers / file signatures.

  1. I am reasoning on tags...
nichtich commented 2 years ago

we should consider that files could be normalized like that: [...]

There are several more ways to push down the start of actual content in a YAML file. YAML syntax is a beast.

Or do you want the indicator to be in the content itself?

Text based file formats rarely have traditional fixed positon magic file numbers but applications should be able to scan the first lines of a YAML file to detect whether it's meant to be YAML-LD.

By the way another solution to this requirement is to state that YAML-LD document must include $schema as its first key.

ioggstream commented 2 years ago

YAML syntax is a beast

This is not going to change :)

YAML-LD document must include $schema as its first key

YAML + JSON-LD + JSON-Schema will not be simpler that YAML + JSON-LD ;)

gkellogg commented 2 years ago

JSON-LD does not have a way to verify that the file is, in fact, JSON-LD if retrieved as application/json unless a describedby link relation points to a context. Otherwise, an embedded @context, which can show anywhere, is useful. While adding a magic-number may be a best practice, I don't think it should be required to treat the data as JSON-LD to be inline with general JSON-LD principles. At most, I'd say that files retrieved as application/ld+yaml SHOULD include a magic-number (whatever is settled on), but may not depending on circumstances.

ioggstream commented 2 years ago

Probably, the only way to include a magic number in yaml is by starting a document with a comment, Eg #?i-am-yaml-ld.

Imho it's not interoperable as comments are not preserved by all parser.

I agree with @gkellogg when he says that JSON -LD does rely on other information to determine whether a JSON documents is an LD. I think that, once parsed, we should take a similar approach.

anatoly-scherbakov commented 2 years ago

I would like to voice the following counter arguments to the introduction of special tags and headers.

Historical example: HTML

HTML had a required doctype header before HTML 5; everyone was copy-pasting that header (or generating with there IDE) — but ultimately, I do not believe it was very informative.

With modern HTML, the header had been reduced to just <!DOCTYPE html>. But even that — does it provide much more information than might be extracted from the existence of <html> root tag?

Duck typing

I am a proponent of conciseness. If the machine can interpret this file as YAML-LD, then it is YAML-LD. If it cannot do that it will yield an error message.

If a YAML file is loadable into an RDF graph (possibly with an external context) — it is YAML-LD.

Distinction between YAML and YAML-LD does not exist

The versatility of JSON-LD and, consequently, YAML-LD is rooted in the fact that a JSON or a YAML document managed by not-LD-aware software can be interpreted as a Linked Data document, even if you do not have control over its content. You just need to supply the right context.

For instance, I am interpreting GitHub API output as JSON-LD, and consuming it into a RDF graph, without any meaningful changes to the document itself. The same might apply to YAML data and configuration files. Just supply the proper context, and the file starts making real sense.

Thus, — how do you distinguish a YAML file vs a YAML-LD file? — You don't. All YAML is potentially YAML-LD if you have the proper context ready.

Summary

I'd voice against mandatory tags. They will limit the interoperability of the standard and the tooling around it, add syntactic noise and magic that non-technical domain experts will have to deal with when writing YAML-LD. I would think we should not burden them with that.

nichtich commented 2 years ago

Distinction between YAML and YAML-LD does not exist

If this was the case, there would be no need to define YAML-LD: just use an existing YAML2JSON conversion and use the result as JSON-LD. If, however, interpreting YAML-LD requires to process YAML-LD documents in any special way not covered by the default YAML2JSON mapping, I would better want to know whether a document requires this additional processing step.

anatoly-scherbakov commented 2 years ago

@nichtich with the idea of the $-context (#11 as per @gkellogg) the special conversion might be omitted, we'd only need the default one.

However, my opinion expressed above is about a slightly different thing. I meant that almost any valid YAML file can be interpreted as YAML-LD, and thus there is no need to specially mark some YAML files as YAML-LD with a special header, comment, or tag.

VladimirAlexiev commented 2 years ago

I side with @pchampin and @anatoly-scherbakov : if we come up with a signature, it should be recommended but not mandatory.

juusoautiosalo commented 2 years ago

Having browsed through the issues in this repository, it seems that the following design principle has been established:

Any valid JSON-LD document can be converted to a valid YAML-LD document with a generic YAML2JSON converter.

I am also in the understanding that JSON-LD does not have a file signature, so I think it cannot be mandatory for YAML-LD either.

(I'm new here and have not formed an opinion if it should be recommended or supported.)

gkellogg commented 2 years ago

Yes, that seems to be the emerging consensus, but a profile allowing more use of YAML features may also be supported eventually, but the base YAML-LD profile is likely limited to simple conversion of the parsed JSON, as described in #12.

VladimirAlexiev commented 2 years ago

@nichtich I claim in #17 "The tag: URI scheme is recommended by the YAML people but is not mandatory, so I'd rather follow TimBL's principles of using resolvable URLs:".

Then https://github.com/json-ld/yaml-ld/issues/17#issuecomment-1142096097 gives a detailed example.

So if we adopt a "signaling" tag, do you agree instead of

!<tag:json-ld.org,2022>

to use something like

!<https://w3c.github.io/yaml-ld-syntax/>
ioggstream commented 2 years ago

@gkellogg I propose to close as "wontfix" the "File signature" issue: like JSON-LD, we really need to inspect the content to understand whether it's YAML-LD.

A forced solution could just create clashes with future YAML versions (we're building upon YAML). I don't know how to make it work for example with files that contain multiple yaml documents, e.g.

# First document in foo.yaml
---
first: file
...
# second document, same file: foo.yaml
---
"I am": the second document
...
VladimirAlexiev commented 2 years ago

@ioggstream I propose to define a short useful but optional piece of advice to put in the Internet Media Type section

Eg https://www.w3.org/TR/turtle/#sec-mediaReg:

Magic number(s): Turtle documents may have the strings @prefix or @base (case sensitive) or the strings 'PREFIX' or 'BASE' (case insensitive) near the beginning of the document.

ioggstream commented 2 years ago

@VladimirAlexiev imho magic numbers need to be reliable. They are used and implemented by generic tools like the file command or by operating systems for file hinting / launching external programs.

I briefly scraped the media type registrations, and on ~ 896 application/* media types, the word "near" is used 1 times ( for sparql-query).

YAML does not provide magic number, and if I were to provide one in YAML-LD, I'd just say "See YAML".

My2¢, R.