YAML Streams and JSON Sequences

ioggstream commented 1 year ago

Question

I've been pointed to JSON Sequences https://datatracker.ietf.org/doc/html/rfc7464

Maybe this can be a related work for converting multi-document YAML Streams to JSON-LD.

WDYT?

@gkellogg @VladimirAlexiev

anatoly-scherbakov commented 1 year ago

Thank you for pointing this out. This approach is also known as:

NDJSON
Which is forked from JSON Lines, often abbreviated as JSONL.

It seems that a conversion of a JSON Lines stream to a YAML stream, and vice versa, is straightforward.

I have tested yq. I cannot seem to be able to convert a JSON Sequence file into a YAML Stream using that tool. Documented it here: https://github.com/mikefarah/yq/discussions/1279

The reverse direction is functional though:

$ cat sample.yaml 
foo: bar
---
baz: kaboom

$ yq -o=json -I=0 . sample.yaml 
{"foo":"bar"}
{"baz":"kaboom"}

So, at least partially, JSON Sequence to YAML Stream mapping is kind of supported in the ecosystem already. Would be interesting to draw a comparison matrix for other tools and libraries.

ioggstream commented 1 year ago

Edited: as @TallTed points, it's application/json-seq

@anatoly-scherbakov could you please integrate the examples above showing the 0x1E and 0x0A characters when they are present? I have not understood whether the output is a compliant JSON Sequence (mediatype ~application/seq+json~ application/json-seq) or not.

Producing a seq+json might need to define a new mediatye (e.g. ld+json-seq :DDD). This is another interesting thread. Personally, I don't know if ++ media type are ok or not, not I am advocating their use.

TallTed commented 1 year ago

Regarding media types with multiple plus signs, see Draft RFC Media Types with Multiple Suffixes. They're not yet officially permitted nor sanctioned by IANA nor IETF, but at least one such media type registration is pending.

gkellogg commented 1 year ago

Thank you for pointing this out. This approach is also known as:

NDJSON

Which is forked from JSON Lines, often abbreviated as JSONL.

Note that NDJSON/JSONL doesn't make use of the RS character, as required in the RFC. In practice, I've seen neither, and they as yet bear no relationship with JSON-LD.

So, at least partially, JSON Sequence to YAML Stream mapping is kind of supported in the ecosystem already. Would be interesting to draw a comparison matrix for other tools and libraries.

They both seem generally related to the concept of a YAML Stream, but differ from the treatment of multiple JSON script tags in HTML.

@anatoly-scherbakov could you please integrate the examples above showing the 0x1E and 0x0A characters when they are present? I have not understood whether the output is a compliant JSON Sequence (mediatype application/seq+json) or not.

Producing a seq+json might need to define a new mediatye (e.g. ld+seq+json :DDD). This is another interesting thread. Personally, I don't know if ++ media type are ok or not, not I am advocating their use.

Actually application/json+seq, not seq+json, which makes sense to me. But, again, NDJSON doesn't strictly seem to conform to the RFC.

As @TallTed notes, there is a proposal for multiple + media types (for Verifiable Credentials, I think), and I think there is general support for this, but wheels move slowly.

If we constrain ourself to the JSON-LD internal representation, we don't have a target for a document stream. Of course, we could introduce such by extension, but it doesn't really seem to relate to the -LD use case, without also extending into some notion of multiple graphs, which aren't treated as a Dataset.

TallTed commented 1 year ago

It turns out that JavaScript Object Notation (JSON) Text Sequences a/k/a JSON Text Sequences (but, so far as I can tell, not properly known as JSON Sequences) have the media type application/json-seq -- no plus sign; not a sub-type of JSON (as would be suggested by application/seq+json) nor of some nonexistent "Sequence" (as would be suggested by application/json+seq); and not requiring multiple + at all.

anatoly-scherbakov commented 1 year ago

Thank you for pointing out the difference between the spec and NDJSON. Indeed, they use different separator characters.

I have encountered usages of NDJSON (say, you can process that with jq) and support for that format in commercial systems (say in https://data.world, which is by the way RDF oriented and supports SPARQL).

I haven't seen usages for JSON Sequences RFC before.

ioggstream commented 1 year ago

@anatoly-scherbakov if you could wrote a one pager presenting the various JSON seq alternatives on the market i will ask to mediatype folks if standardizing a different format for JSON is advisable.

This is clearly unrelated to this spec and this issue should be probably addressed in yaml media type document.

Wdyt?

anatoly-scherbakov commented 1 year ago

@ioggstream I started to write a little memo about it but found this page in Wikipedia which, it seems, about does the job. Does it?

ioggstream commented 1 year ago

@anatoly-scherbakov I think we can say

This spec does not mandate a way to convert multi-document YAML Streams to a specific format such as json-seq. The implementer is then free to convert a multi-document YAML stream to multiple, separate JSON texts, to a single json-seq file, or to some other custom multi-document JSON format

WDYT?

gkellogg commented 1 year ago

@anatoly-scherbakov I think we can say

This spec does not mandate a way to convert multi-document YAML Streams to a specific format such as json-seq. The implementer is then free to convert a multi-document YAML stream to multiple, separate JSON texts, to a single json-seq file, or to some other custom multi-document JSON format

WDYT?

If we say this, then we can't really have any tests involving multi-document streams, which is fine. If the concept of multi-document streams is important, it would be for JSON-LD as well, and should probably be taken up there. However, I don't really see how it fits with our data model, which already has the concept of named graphs which is a closely related way of dealing with this in the RDF world.

anatoly-scherbakov commented 1 year ago

Yes, certainly.

ioggstream commented 1 year ago

... named graph ...

Can you clarify the relation between a YAML-LD document and a named graph? iiuc a YAML stream can contain multiple documents, and each one can contain multiple named graphs.

gkellogg commented 1 year ago

Named Graphs are often used to describe the content of some particular RDF source, particularly when used in SPARQL. That is why I said that the closet analog to multiple files from the RDF world is probably named graphs. Generally, each separate document would be considered an "RDF Source", and the only system that I can think of that deals with more than one RDF Source is SPARQL.

For example:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?who ?g ?mbox
FROM <http://example.org/dft.ttl>
FROM NAMED <http://example.org/alice>
FROM NAMED <http://example.org/bob>
WHERE
{
   ?g dc:publisher ?who .
   GRAPH ?g { ?x foaf:mbox ?mbox }
}

Where http://example.org/alice and http://example.org/bob represent different endpoints/RDF Sources. (Example taken from 13.2.3 Combining FROM and FROM NAMED in SPARQL 1.1 Query.

In this view, each document in a YAML-LD stream would be a different RDF Source, although the analogy breaks down as IIUC there is no way to name separate documents in a YAML stream, but we might define a confention, if naming individual documents is deemed important.

ioggstream commented 1 year ago

each document in a YAML-LD stream would be a different RDF Source

How many graphs does the following yaml document from https://w3c.github.io/json-ld-syntax/#example-referencing-named-graphs-using-an-id-map-with-none contain?

---
- "@id": http://example.org/foaf-graph
  http://www.w3.org/ns/prov#generatedAtTime:
    - "@value": 2012-04-09T00:00:00
      "@type": http://www.w3.org/2001/XMLSchema#dateTime
  http://example.org/graphMap:
    - "@graph":
        - "@id": http://manu.sporny.org/about#manu
          "@type":
            - http://xmlns.com/foaf/0.1/Person
          http://xmlns.com/foaf/0.1/name:
            - "@value": Manu Sporny
          http://xmlns.com/foaf/0.1/knows:
            - "@id": https://greggkellogg.net/foaf#me
    - "@graph":
        - "@id": https://greggkellogg.net/foaf#me
          "@type":
            - http://xmlns.com/foaf/0.1/Person
          http://xmlns.com/foaf/0.1/name:
            - "@value": Gregg Kellogg
          http://xmlns.com/foaf/0.1/knows:
            - "@id": http://manu.sporny.org/about#manu  
...

gkellogg commented 1 year ago

If you click on the "TriG" tab, you'll see it in RDF. It contains two anonymous graphs:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/foaf-graph> <http://example.org/graphMap> _:b0,  _:b1;
   prov:generatedAtTime "2012-04-09T00:00:00"^^xsd:dateTime .

_:b0 {
  <http://manu.sporny.org/about#manu> a foaf:Person;
     foaf:name "Manu Sporny";
     foaf:knows <https://greggkellogg.net/foaf#me> .
}

_:b1 {
  <https://greggkellogg.net/foaf#me> a foaf:Person;
     foaf:name "Gregg Kellogg";
     foaf:knows <http://manu.sporny.org/about#manu> .
}

Except at the very top @graph always introduces a named graph. When used in the top-most object, it allows multiple nodes to be defined in the default graph. This was a confusing part of the 1.0 spec, and may be more intuitive using the @included keyword.

ioggstream commented 1 year ago

So iiuc I have a single YAML-LD document containing two graphs, right?

gkellogg commented 1 year ago

Yes (you can try it in my distiller), but I needed to quote the datetime value.

ioggstream commented 1 year ago

@gkellogg I think the simplest thing to do is to just map a YAML-LD stream to a sequence of JSON-LD files. If there's no analogy with JSON-LD, I don't think we should not force an RDF structure on YAML streams.

gkellogg commented 1 year ago

We certainly need a case for when a stream contains just a single document for the API methods to operate upon. We don't have a model for how to run API methods over multiple documents in a single go.

In the toRdf case, I would expect that each document would be processed to generate statements and the statements would all go into a single dataset. There is no reverse operation for a dataset resulting in multiple documents.

For the other cases, we could say that the API is run on each document, in turn, and the result is a stream containing the result of processing each document.

It's really the case of the stream being turned into JSON, or the results of processing being rendered as JSON where we don't have a stream model. If we are to cover this case, without introducing something like NDJSON-LD, the expected result would probably look like how JSON-LD handles multiple HTML script elements containing JSON-LD, or it is simply left unspecified.

ioggstream commented 1 year ago

or it is simply left unspecified

I think it's a reasonable choice since we don't currently have user feedback. We could provide further specifications on that. For example, a YAML file could convey a JSON-LD together with one or more frame files.

Q: is a file containing a JSON-LD Frame, a .jsonld file? Do json-ld frames have a specific media type/file extension?

gkellogg commented 1 year ago

Q: is a file containing a JSON-LD Frame, a .jsonld file? Do json-ld frames have a specific media type/file extension?

Yes, that’s the convention. IIRC, there is an HTTP profile parameter that can be used to identify a frame document, but never happens in practice.

VladimirAlexiev commented 1 year ago

Streaming parsing/serialization is an important topic since it is a road towards improved scalability: parsers/serializers that build up a large in-memory model before spitting it out, are very slow on large data
- Pointed out as consideration in https://github.com/eclipse/rdf4j/issues/3654
NDJSON is one way (relying on newlines to delineate the chunks)
- I raised it for possible standardization: https://github.com/w3c/sparql-12/issues/140
- Ontotext implemented it in rdf4j: https://github.com/eclipse/rdf4j/issues/2840
https://w3c.github.io/json-ld-streaming/ is another way. That spec says things like "put @context first in file" and "put @id first in object". So it doesn't rely on newline markers for streaming, but on a slightly restricted JSON structure.
- @rubensworks implemented this in https://github.com/rubensworks/jsonld-streaming-parser.js, https://github.com/rubensworks/jsonld-streaming-serializer.js
@ioggstream @gkellogg Do we expect the chunks (NDJSON lines, YAML streams) to share a common part that eg specifies the tags and context? Or they would repeat it in every chunk?

gkellogg commented 1 year ago

@ioggstream @gkellogg Do we expect the chunks (NDJSON lines, YAML streams) to share a common part that eg specifies the tags and context? Or they would repeat it in every chunk?

Currently, we don't really deal with streams. An extension for something like NDJSON-LD might be an interesting topic for TPAC.

gkellogg commented 1 year ago

Discussed at TPAC F2F

Generally feeling that this may be useful beyond JSON-LD and YAML-LD, and something like an "LD Streams" framework might be useful which this could fit into.

Pierre-Antoine Champin: https://github.com/json-ld/yaml-ld/issues/63

Gregg Kellogg: Touched on this earlier, JSON-LD doesn't have concept of multiple documents. How did we deal w/ YAML streams? Treat each document in there as its own JSON-LD document and process accordingly.

Gregg Kellogg: JSON-LD defined as API, might need sequences API calls and recompose possibly, YAML-LD, compact things in stream, do them in sequence? Seems that this needs to bounce back up to JSON-LD. Is there an analog?

Pierre-Antoine Champin: My concern regarding that, I can see a number of use caess, sensor use case earlier.

Pierre-Antoine Champin: I'm not sure if we can come up with a unique way of dealing with those things. That might be just a lack of imagination.

Benjamin Young: Yes, would like to see this happen, not expressly YAML related... YAML's origin is out of mime documents and email containers, where you were sending a bundle that was all inerrelated. First document was foundational, other documents were attachments.

Anatoly Scherbakov: Thank you very much all, I will unfortunately have to leave. It was quite interesting to participate, thank you again!

Benjamin Young: Newline deliminated JSON and JSON -- server sent events, event notifications, JSON - what's coming next... YAML multidoc is understood as a unit together. Context documents being sent together in stream. For most newline deliminted JSON, interesting things to explore here, what's been done for Link header for example on bare JSON documents.

Pierre-Antoine Champin: YAML streams in the YAML spec: https://yaml.org/spec/1.2.2/#92-streams

Benjamin Young: For example, if you start w/ context, maybe that context applies to everything in the stream. Where should these happen, where shouldn't they happen, this isn't only about YAML.

Gregg Kellogg: There is a broader concept of LD streams, could have applications in JSON-LD, fits in nicely with YAML... but why not other formats, why not NTriple streams? One could argue that NTriples are another multidocument mechanism since all statements stand on their own.

Gregg Kellogg: There might be a notion of stream documents, each element of stream could have its own format. Does each have its own location? Even though this issue isn't about YAML streams, it begs for additional work for LD Streams and until that happens, with regard to this issue... there are two ways forward, 1) YAML-LD is only defined for streams in a single document, or 2) YAML-LD streams are treated the way multiple script elements are treated,

Phil Archer: Linked Data fragments was similar.

Gregg Kellogg: Yes, it was, had more to do with SPARQL querying...

Benjamin Young: https://linkeddatafragments.org/

Benjamin Young: https://www.w3.org/TR/json-ld11-streaming/

Gregg Kellogg: There was work done on JSON-LD streaming, but specifically took into consideration open pipe on which you were continuously interpreting.

Gregg Kellogg: These are all references that should be considered.

Gregg Kellogg: We can leave this for a future meeting noting this discussion.

Benjamin Young: JSON-LD Streaming note was about parsing a *single* JSON documents as it was streamed into the parser (which is very different than a stream of individual JSON-LD docs)--just to clarify relationships.

ioggstream commented 1 year ago

@gkellogg sorry for the delay. Will the streams topic be left as unspecified and deferred to further documents/wg?

gkellogg commented 1 year ago

A necessary start for this is a description of NDJSON-LD, for which we've setup a repo (thanks @pchampin) https://github.com/json-ld/ndjson-ld. @niklasl has done some related work and volunteered to get a start on a specification. Given that, we can refer to it from YAML-LD.

ioggstream commented 1 year ago

Ok. We are in WGLC for YAML media types, and we are waiting for IETF media type feedback. I really hope that when YAML is registered (e.g. hopefully before 2023/06) we are ready to file the .yamlld registration - even if with a preliminary work that allows us to enable its basic usage for IDEs and content-negotiation.

gkellogg commented 1 year ago

We created https://github.com/json-ld/ndjson-ld to work on a specification for NDJSON-LD; @niklasl volunteered to get it started.

It should be fairly straight forward, just delegating the various API calls to each line from the NDJSON document, and imposing a serialization requirement on the result that each line be serialized without additional whitespace. This could, perhaps, just use JCS, but that might be overkill.

The stumbling blocks in the current spec could then defer and update the API methods from NDJSON-LD.

We may need to consider provisions for operating on YAML as a stream or a document, as is the case for most existing YAML libraries.

gkellogg commented 1 year ago

The NDJSON issue https://github.com/ndjson/ndjson.github.io/issues/1 goes into some of the areas of divergence. It also notes LDJSON, but IMO, LDJSON-LD would be a bit too much :)

Basically, NDJSON purports to be a living spec, while JSON Lines does not. But, looking through the comments, RFC7464 may actually be the better fit, as it does not restrict the use of newlines within individual JSON records, as RS is not otherwise valid within JSON.

We can dog-shed the naming issue later. We should put many of these issues into the ndjson-ld repo (or whatever we eventually decide to name it).

cc/ @lrosenthol

gkellogg commented 1 year ago

This issue was discussed in the 2022-10-12 meeting.

Subtopic: YAML Streams and JSON Sequences yaml-ld#63 ✪

Gregg Kellogg: That's what pushed NDJSON-LD

Gregg Kellogg: Roberto proposes to map a YAML-LD to a sequence of JSON-LD files

Gregg Kellogg: Proposing to update the spec with a hypothetical mapping to NDJSON-LD so as we can start to flush out the missing components of the spec right now. I will spend some time on that.

Leonard Rosenthol: Does this only apply to streams, or also for a YAML-LD file that contains multiple documents?

Gregg Kellogg: In YAML, stream is a sequence of documents separated by "---". This has a well defined meaning within YAML. In YAML-LD spec, part of the process is to convert YAML-LD into Internal Representation, which includes splitting stream into individual documents.

Gregg Kellogg: What if a stream contains a single document? Does it yield that document, or a stream with that document? For NDJSON-LD probably that's the latter, and for YAML-LD this might depend upon HTTP media type or an API method perhaps (different methods for streams vs documents). This is a subject of consideration.

Leonard Rosenthol: Makes sense. I am thinking of this in respect to having physical files more than something else.

Gregg Kellogg: In file representation or, say, in a multipart/MIME email, or in a stream where you process records as they come through, — this can be hard in API sense. API endpoints create promises and you might expect the promise to fulfill only once the entire stream is processed. Might be not adequate for a real time stream. But we might just focus on the "closed" use case and leave the "open stream" use case for later.

Gregg Kellogg: We need to list use cases for both and look at the other W3C work on realtime processing and open data streams to see if we can find any relevance.

json-ld / yaml-ld

YAML Streams and JSON Sequences #63

Question

Subtopic: YAML Streams and JSON Sequences yaml-ld#63 ✪