Open ioggstream opened 1 year ago
Thank you for pointing this out. This approach is also known as:
It seems that a conversion of a JSON Lines stream to a YAML stream, and vice versa, is straightforward.
I have tested yq. I cannot seem to be able to convert a JSON Sequence file into a YAML Stream using that tool. Documented it here: https://github.com/mikefarah/yq/discussions/1279
The reverse direction is functional though:
$ cat sample.yaml
foo: bar
---
baz: kaboom
$ yq -o=json -I=0 . sample.yaml
{"foo":"bar"}
{"baz":"kaboom"}
So, at least partially, JSON Sequence to YAML Stream mapping is kind of supported in the ecosystem already. Would be interesting to draw a comparison matrix for other tools and libraries.
Edited: as @TallTed points, it's application/json-seq
@anatoly-scherbakov could you please integrate the examples above showing the 0x1E and 0x0A characters when they are present? I have not understood whether the output is a compliant JSON Sequence (mediatype ~application/seq+json~ application/json-seq) or not.
Producing a seq+json might need to define a new mediatye (e.g. ld+json-seq :DDD). This is another interesting thread. Personally, I don't know if ++
media type are ok or not, not I am advocating their use.
Regarding media types with multiple plus signs, see Draft RFC Media Types with Multiple Suffixes. They're not yet officially permitted nor sanctioned by IANA nor IETF, but at least one such media type registration is pending.
Thank you for pointing this out. This approach is also known as:
- NDJSON
- Which is forked from JSON Lines, often abbreviated as JSONL.
Note that NDJSON/JSONL doesn't make use of the RS
character, as required in the RFC. In practice, I've seen neither, and they as yet bear no relationship with JSON-LD.
So, at least partially, JSON Sequence to YAML Stream mapping is kind of supported in the ecosystem already. Would be interesting to draw a comparison matrix for other tools and libraries.
They both seem generally related to the concept of a YAML Stream, but differ from the treatment of multiple JSON script tags in HTML.
@anatoly-scherbakov could you please integrate the examples above showing the 0x1E and 0x0A characters when they are present? I have not understood whether the output is a compliant JSON Sequence (mediatype application/seq+json) or not.
Producing a seq+json might need to define a new mediatye (e.g. ld+seq+json :DDD). This is another interesting thread. Personally, I don't know if
++
media type are ok or not, not I am advocating their use.
Actually application/json+seq
, not seq+json
, which makes sense to me. But, again, NDJSON doesn't strictly seem to conform to the RFC.
As @TallTed notes, there is a proposal for multiple +
media types (for Verifiable Credentials, I think), and I think there is general support for this, but wheels move slowly.
If we constrain ourself to the JSON-LD internal representation, we don't have a target for a document stream. Of course, we could introduce such by extension, but it doesn't really seem to relate to the -LD use case, without also extending into some notion of multiple graphs, which aren't treated as a Dataset.
It turns out that JavaScript Object Notation (JSON) Text Sequences
a/k/a JSON Text Sequences
(but, so far as I can tell, not properly known as JSON Sequences
) have the media type application/json-seq
-- no plus sign; not a sub-type of JSON (as would be suggested by application/seq+json
) nor of some nonexistent "Sequence" (as would be suggested by application/json+seq
); and not requiring multiple +
at all.
Thank you for pointing out the difference between the spec and NDJSON. Indeed, they use different separator characters.
I have encountered usages of NDJSON (say, you can process that with jq) and support for that format in commercial systems (say in https://data.world, which is by the way RDF oriented and supports SPARQL).
I haven't seen usages for JSON Sequences RFC before.
@anatoly-scherbakov if you could wrote a one pager presenting the various JSON seq alternatives on the market i will ask to mediatype folks if standardizing a different format for JSON is advisable.
This is clearly unrelated to this spec and this issue should be probably addressed in yaml media type document.
Wdyt?
@ioggstream I started to write a little memo about it but found this page in Wikipedia which, it seems, about does the job. Does it?
@anatoly-scherbakov I think we can say
This spec does not mandate a way to convert multi-document YAML Streams to a specific format such as json-seq. The implementer is then free to convert a multi-document YAML stream to multiple, separate JSON texts, to a single json-seq file, or to some other custom multi-document JSON format
WDYT?
@anatoly-scherbakov I think we can say
This spec does not mandate a way to convert multi-document YAML Streams to a specific format such as json-seq. The implementer is then free to convert a multi-document YAML stream to multiple, separate JSON texts, to a single json-seq file, or to some other custom multi-document JSON format
WDYT?
If we say this, then we can't really have any tests involving multi-document streams, which is fine. If the concept of multi-document streams is important, it would be for JSON-LD as well, and should probably be taken up there. However, I don't really see how it fits with our data model, which already has the concept of named graphs which is a closely related way of dealing with this in the RDF world.
Yes, certainly.
... named graph ...
Can you clarify the relation between a YAML-LD document and a named graph? iiuc a YAML stream can contain multiple documents, and each one can contain multiple named graphs.
Named Graphs are often used to describe the content of some particular RDF source, particularly when used in SPARQL. That is why I said that the closet analog to multiple files from the RDF world is probably named graphs. Generally, each separate document would be considered an "RDF Source", and the only system that I can think of that deals with more than one RDF Source is SPARQL.
For example:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?who ?g ?mbox
FROM <http://example.org/dft.ttl>
FROM NAMED <http://example.org/alice>
FROM NAMED <http://example.org/bob>
WHERE
{
?g dc:publisher ?who .
GRAPH ?g { ?x foaf:mbox ?mbox }
}
Where http://example.org/alice and http://example.org/bob represent different endpoints/RDF Sources. (Example taken from 13.2.3 Combining FROM and FROM NAMED in SPARQL 1.1 Query.
In this view, each document in a YAML-LD stream would be a different RDF Source, although the analogy breaks down as IIUC there is no way to name separate documents in a YAML stream, but we might define a confention, if naming individual documents is deemed important.
each document in a YAML-LD stream would be a different RDF Source
How many graphs does the following yaml document from https://w3c.github.io/json-ld-syntax/#example-referencing-named-graphs-using-an-id-map-with-none contain?
---
- "@id": http://example.org/foaf-graph
http://www.w3.org/ns/prov#generatedAtTime:
- "@value": 2012-04-09T00:00:00
"@type": http://www.w3.org/2001/XMLSchema#dateTime
http://example.org/graphMap:
- "@graph":
- "@id": http://manu.sporny.org/about#manu
"@type":
- http://xmlns.com/foaf/0.1/Person
http://xmlns.com/foaf/0.1/name:
- "@value": Manu Sporny
http://xmlns.com/foaf/0.1/knows:
- "@id": https://greggkellogg.net/foaf#me
- "@graph":
- "@id": https://greggkellogg.net/foaf#me
"@type":
- http://xmlns.com/foaf/0.1/Person
http://xmlns.com/foaf/0.1/name:
- "@value": Gregg Kellogg
http://xmlns.com/foaf/0.1/knows:
- "@id": http://manu.sporny.org/about#manu
...
If you click on the "TriG" tab, you'll see it in RDF. It contains two anonymous graphs:
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/foaf-graph> <http://example.org/graphMap> _:b0, _:b1;
prov:generatedAtTime "2012-04-09T00:00:00"^^xsd:dateTime .
_:b0 {
<http://manu.sporny.org/about#manu> a foaf:Person;
foaf:name "Manu Sporny";
foaf:knows <https://greggkellogg.net/foaf#me> .
}
_:b1 {
<https://greggkellogg.net/foaf#me> a foaf:Person;
foaf:name "Gregg Kellogg";
foaf:knows <http://manu.sporny.org/about#manu> .
}
Except at the very top @graph
always introduces a named graph. When used in the top-most object, it allows multiple nodes to be defined in the default graph. This was a confusing part of the 1.0 spec, and may be more intuitive using the @included
keyword.
So iiuc I have a single YAML-LD document containing two graphs, right?
Yes (you can try it in my distiller), but I needed to quote the datetime value.
@gkellogg I think the simplest thing to do is to just map a YAML-LD stream to a sequence of JSON-LD files. If there's no analogy with JSON-LD, I don't think we should not force an RDF structure on YAML streams.
We certainly need a case for when a stream contains just a single document for the API methods to operate upon. We don't have a model for how to run API methods over multiple documents in a single go.
In the toRdf
case, I would expect that each document would be processed to generate statements and the statements would all go into a single dataset. There is no reverse operation for a dataset resulting in multiple documents.
For the other cases, we could say that the API is run on each document, in turn, and the result is a stream containing the result of processing each document.
It's really the case of the stream being turned into JSON, or the results of processing being rendered as JSON where we don't have a stream model. If we are to cover this case, without introducing something like NDJSON-LD, the expected result would probably look like how JSON-LD handles multiple HTML script elements containing JSON-LD, or it is simply left unspecified.
or it is simply left unspecified
I think it's a reasonable choice since we don't currently have user feedback. We could provide further specifications on that. For example, a YAML file could convey a JSON-LD together with one or more frame files.
Q: is a file containing a JSON-LD Frame, a .jsonld file? Do json-ld frames have a specific media type/file extension?
Q: is a file containing a JSON-LD Frame, a .jsonld file? Do json-ld frames have a specific media type/file extension?
Yes, that’s the convention. IIRC, there is an HTTP profile parameter that can be used to identify a frame document, but never happens in practice.
@context
first in file" and "put @id
first in object". So it doesn't rely on newline markers for streaming, but on a slightly restricted JSON structure.
- @ioggstream @gkellogg Do we expect the chunks (NDJSON lines, YAML streams) to share a common part that eg specifies the tags and context? Or they would repeat it in every chunk?
Currently, we don't really deal with streams. An extension for something like NDJSON-LD might be an interesting topic for TPAC.
@gkellogg sorry for the delay. Will the streams topic be left as unspecified and deferred to further documents/wg?
A necessary start for this is a description of NDJSON-LD, for which we've setup a repo (thanks @pchampin) https://github.com/json-ld/ndjson-ld. @niklasl has done some related work and volunteered to get a start on a specification. Given that, we can refer to it from YAML-LD.
Ok. We are in WGLC for YAML media types, and we are waiting for IETF media type feedback. I really hope that when YAML is registered (e.g. hopefully before 2023/06) we are ready to file the .yamlld registration - even if with a preliminary work that allows us to enable its basic usage for IDEs and content-negotiation.
We created https://github.com/json-ld/ndjson-ld to work on a specification for NDJSON-LD; @niklasl volunteered to get it started.
It should be fairly straight forward, just delegating the various API calls to each line from the NDJSON document, and imposing a serialization requirement on the result that each line be serialized without additional whitespace. This could, perhaps, just use JCS, but that might be overkill.
The stumbling blocks in the current spec could then defer and update the API methods from NDJSON-LD.
We may need to consider provisions for operating on YAML as a stream or a document, as is the case for most existing YAML libraries.
The NDJSON issue https://github.com/ndjson/ndjson.github.io/issues/1 goes into some of the areas of divergence. It also notes LDJSON, but IMO, LDJSON-LD would be a bit too much :)
Basically, NDJSON purports to be a living spec, while JSON Lines does not. But, looking through the comments, RFC7464 may actually be the better fit, as it does not restrict the use of newlines within individual JSON records, as RS is not otherwise valid within JSON.
We can dog-shed the naming issue later. We should put many of these issues into the ndjson-ld repo (or whatever we eventually decide to name it).
cc/ @lrosenthol
Question
I've been pointed to JSON Sequences https://datatracker.ietf.org/doc/html/rfc7464
Maybe this can be a related work for converting multi-document YAML Streams to JSON-LD.
WDYT?
@gkellogg @VladimirAlexiev