Base format - Githubissues

gkellogg commented 1 year ago

From json-ld/yaml-ld#63, there are at least three different formats that could serve as the basis for this work:

NDJSON – newline delimited JSON, which prohibits the use of newlines within a given record. Media type is specified as application/x+ndjson, which is not recommended. It purports to be a living spec. Previous support for comments between records, now removed.
JSON Lines – the basis from which NDJSON was forked. Apparently not intended as a spec. No suggested media type, although issue comments suggest application/x+jsonlines and application/x-json-stream. Otherwise, virtually indistinguishable from NDJSON.
LDJSON – Aside from the obvious confusion of LDJSON-LD, I can't find anything definitive other than the Wikipedia entry.
JSON Text Sequences – RFC7464 – This has the obvious advantage of being an RFC with a registered media type application/json-seq. Allows newlines within JSON records, but requires records be proceeded by an ASCII Record Separator (%x1E). A disadvantage is that this is not as amenable to presentation or editing by a common text editor.

gkellogg commented 1 year ago

@pietercolpaert pointed out his implementation of JSON-LD Stream which meets similar goals.

gkellogg commented 1 year ago

Also by @pietercolpaert, Linked Data Event Streams.

gkellogg commented 1 year ago

This issue was discussed in the 2022-10-12 meeting

Gregg Kellogg: https://github.com/json-ld/ndjson-ld/issues/1

Gregg Kellogg: We have a repo for NDJSON-LD, and Nicholas has volunteered to spend time on that. That prompted leonardr to jump in and discuss. It is based upon NDJSON and similar to JSON Lines and a few more similar formats, very similar to each other. It is possible for one spec to handle multiple different formats though only one of them has a well defined media type.

Leonard Rosenthol: I don't have a preference between them. Trying to understand what the use case is? What are we trying to accomplish specifically? Obviously JSON-LD and YAML-LD make sense for this group. Why is it making sense to tackle the problem of JSON Records, and how are we going to solve it?

Gregg Kellogg: YAML is a string format that supports multiple embedded documents in a file. In order to do an LD version of YAML, we need to decide how to deal with them. If there was a JSON-LD line streaming format (or a general streaming format for RDF in general even) it would be useful to do. We'd just process each document in a YAML stream using JSON-LD methods.

Gregg Kellogg: Potentially, any RDF format might have a way to describe a stream of records. One use case might be an open connection for actual streaming of real time data. There's a JSON-LD streaming spec. If you want to operate upon completely separate records it would be useful.

Pierre-Antoine Champin: Another use case comes from the SOLID system. They have performance issues serializing large "containers" (connections). Potentially, having a streaming JSON-LD format could help that.

Gregg Kellogg: https://github.com/json-ld/ndjson-ld/issues/3

Gregg Kellogg: Would be useful to put these into issue 3 to start collecting them. Ultimately, the spec (itself or a companion document) should list a few use cases.

Gregg Kellogg: Nicholas, could you describe for instance your motivation for NDJSON-LD?

Niklas Lindström: We've used a line-based format for internal purposes. Thinking of publishing data in such a format. We had raw dumps we published and just declared every line was a separate JSON-LD document.

Niklas Lindström: We are missing a clear definition though, and it is unspecified how to associate a JSON-LD context with such a document.

Niklas Lindström: For us, HTTP response header specified the context; the document itself didn't.

pietercolpaert commented 1 year ago

Also by @pietercolpaert, Linked Data Event Streams.

Copy from the mailinglist:

LDES (https://w3id.org/ldes/specification) together with TREE (https://w3id.org/tree/specification) are RDF vocabularies to describe collections and members that are part of that collection. An ldes:EventStream is a tree:Collection with immutable members and thus a “log” that always grows.

Simple example of something that is an LDES:

<C1> a ldes:EventStream ;
     tree:member <streetname1-v1>, <streetname1-v2> .

<streetname1-v1> rdfs:label "Station Road" ;
         dcterms:isVersionOf <streetname1> ;
                 dcterms:created "2020-01-01T00:10:00Z"^^xsd:dateTime .
<streetname1-v2> rdfs:label "Station Square" ;
                 dcterms:isVersionOf <streetname1> ;
                 dcterms:created "2021-01-10T00:10:00Z"^^xsd:dateTime .

Using TREE you can then say that this is the first page of the LDES, and that tree:Relations exist towards other pages. Using these relation objects, you can then describe what you can find on that page: e.g., everything later in time than a certain timestamp, or everything within a geospatial area, or all members with an rdfs:label that contains a certain substring, etc.

This means that, in comparison to streamed JSON-LD, that LDES is:

independent from serialization
making containment explicit using RDF: you need to say you’re part of a stream using tree:member and describe the event stream as a kind of dcat:Dataset
using TREE to also describe how structures of interlinked pages publish that event stream, and one stream can be published using multiple “views” (e.g., a Triple Pattern Fragments, a SPARQL endpoint, a substring fragmentation, etc.)

The goal is thus different, and maybe even complementary. I see value in streamed JSON-LD as a convience mechanism to just append JSON to a file and still being able to automatically translate the full file to RDF quads. You need more than just NLDJSON as you also want to be more efficient and or example not repeat the @context on each write.

json-ld / ndjson-ld

Base format #1