ndjson-ld format - Githubissues

VladimirAlexiev commented 3 years ago

Why?

Newline-delimited JSON (line-oriented JSON) is often used in preference of JSON because it is streamable and can be processed with line-oriented tools (eg grep)

Previous work

specs: http://ndjson.org, https://github.com/ndjson/ndjson-spec, https://jsonlines.org (line-oriented JSON)
Sample data: https://sn-scigraph.figshare.com/ (projects/grants from Dimensions) (line-oriented JSON-LD)
Ontotext is now implementing ndjson-ld input in rdf4j and GraphDB https://github.com/desislava-hristova-ontotext/rdf4j/pull/1
We're also considering output (SPARQL CONSTRUCT serialization as ndjson) but that's trickier

Proposed solution

We're considering MIME type application/x-ld+ndjson (derived from the existing MIME type for JSON-LD application/ld+json and the MIME type of Newline Delimited JSON application/x-ndjson)
We're considering file extensions .ndjsonld, and maybe .jsonl and .ndjson

Considerations for backward compatibility

None?

ericprud commented 3 years ago

@VladimirAlexiev , i found some examples in the specs but didn't find what you were referring to in the "Sample data" link above.

I think streaming JSON would be an excellent tool for long-running SPARQL results and line-oriented is a nice benefit. I guess this is a small step from current JSON results as they already require newlines to be escaped, right?

TallTed commented 3 years ago

NDJSON is apparently also known as all of LDJSON, Line_Delimited_JSON, JSON_Lines, JSON_Streaming, JSONL, ndjson, NDJSON, and Newline_Delimited_JSON -- so this new thing could even be LD-JSON-LD!

Except that JSON-L (or JSONL) is definitely different from NDJSON... And I imagine there are other issues hiding behind the not-quite-synonym list above.

What is the (anticipated?) relationship between ND-JSON-LD (or NDJSON-LD) and JSON-LD (and 1.0, 1.1, etc.)?

Both JSON Lines and Newline Delimited JSON say they're also known by the other name, but as noted above these are different creatures. It's going to be necessary very quickly to clearly define which you're working with (and why not the other), as well as what may happen if the streams are crossed.

How and why is "Newline Delimited JSON-LD" (or is it "Linked Data in ND-JSON"?) related to the 1.2 update of SPARQL, which is the focus of this github project?

It seems to me that ND-JSON-LD should be a distinct project, maybe associated with JSON-LD given their apparent close cousin relationship.

On Media Type...

x- Media Types are generally frowned on these days, for good reason. Which the NDJSON folk know, and haven't done much about (https://github.com/ndjson/ndjson-spec/issues/19, https://github.com/ndjson/ndjson-spec/issues/21).

Media Types with Multiple Suffixes is heading toward RFC status, and application/ld+json already exists, so you might consider application/nd+ld+json, possibly with a synonymous application/ld+nd+json (which would need the apparently stagnant NDJSON project to change from application/x-ndjson to application/nd+json)

If you don't want to pin hopes on Media Types with Multiple Suffixes, you might also consider application/ld+ndjson, and again pushing the NDJSON project to change from application/x-ndjson to application/ndjson ...

Or leave the NDJSON project fallow as it stands, and consider application/ld+x-ndjson, which at least follows the general rules of Media Types, and parallels the existing application/ld+json.

This feels like a lot of frayed ends in search of a knot. That knot may be worthwhile, but I think it should be distinct from SPARQL 1.2.

afs commented 3 years ago

Won't it be application/sparql-results+x-ndjson for SELECT results and application/ld+x-ndjson for CONSTRUCT/DESCRIBE?

From JSON-LD, application/ld+... is about RDF graphs and datasets, and ...+json the concrete syntax choice. (c.f. rdf+xml).

gkellogg commented 3 years ago

It would seem that the appropriate place for this effort would be the JSON-LD CG (AKA the JSON for Linking Data Community Group), although the JSON-LD WG remains as a maintenance group.

Also, note that the WG published the Streaming JSON-LD note, which addresses the need for a streaming serialization format, but in this case by imposing an order object entries in the line serialization, although it is not a line format, per se.

At first glance, the NDJSON-LD would seem to follow well given an out-of-bound specified context, such as via Link header. That would make it much the same as parsing an outer object containing @context and the values of @graph. Going beyond, an extension for supporting an @context at the top level, either as a URL, or a one-line object, would be straight-forward. Nothing would prevent an individual NDJSON line from including @context, either, unless there is some limitation on line length I didn't notice.

VladimirAlexiev commented 3 years ago

@ericprud The sample data we have cited in our jira looks like this

{"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", "type": "MonetaryGrant", "id": "sg:grant.6616389",...\n 
{"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", "type": "MonetaryGrant", "id": "sg:grant.6616214",...\n ...

It's probably here http://scigraph.downloads.uberresearch.com/archives/current/grants.tar.gz

Right now we are considering NDJSON-LD for input,

but you make a good point that a streaming sparql-results-json for SELECT output would also be useful.

In fact, CONSTRUCT output as NDJSON-LD is non trivial because how would it know which triples to put on each line? How would it know which is the "main loop" of the query, or the "primary key" so to speak?

@TallTed thanks for the pointers to MIME developments!

@gkellogg thanks for the pointer to Streaming jsonld!

rubensworks commented 3 years ago

Pinging @wouterbeek here regarding NDJSON-LD, as he suggested it a while back here https://github.com/rubensworks/jsonld-streaming-parser.js/issues/64

ericprud commented 3 years ago

There's a longish discussion of media subtypes containing '+' on media-types@ietf.org. (I don't actually think nd+json is viable because people assume that +json means the resource matchs 4627, but folks can always relax their standards if they don't mind breaking some stuff.)

afs commented 3 years ago

sparql-results+json is streaming if the fields are in the right order ("head" before "results").

Streaming a line format, used without the Content-length: and a line format, means there can be silent truncation of results.
No Content-Length interacts with connection management with some DOS potential by badly behaved clients.

These aren't reasons not to do it - they are things that should be noted in any design. Inside the enterprise is different environment to the open web.

jaw111 commented 3 years ago

Just to note a real-world use case for newline delimited JSON-LD. For one application we developed, we index suitably framed JSON-LD documents in Elasticsearch where the documents are imported to Elasticsearch as NDJSON. That process uses a Jena model to gather RDF data from various sources (blackboard design pattern), then extracts and frames a sub-graph for resources of a given type.

Whilst it would be nice to be able to get some NDJSON-LD serialization as the result of a SPARQL query directly, I think it would be necessary to have some way to indicate a JSON-LD frame (rather than just a context as @gkellogg suggested) in order to guarantee consistent nesting/embedding in the JSON object structure.

Arguably for our usage the JSON-LD frame IS the query, a SPARQL query is not even needed.

TallTed commented 3 years ago

Streaming a line format, used without the Content-length: and a line format, means there can be silent truncation of results.

@afs -- I would think that adding a specific termination marker to the syntax would avoid silent truncation without Content-length: -- and including the net line count in the termination marker (at which point, it should be trivially known) would prevent errors from missing lines, though it wouldn't give any good way to recover from such, other than repeating the request and running a diff on the two streams if the second also had some drop-outs...

afs commented 3 years ago

Content-Length is understood by HTTP/1.1 libraries and is used by them to reuse connections.

A trailer as protocol-level termination and including end-transfer information would be a good thing . It does not completely replace Content-Length though.

There is of course HTTP/2 - new protocol work ought to be an abstract design that exploits HTTP/2 features, can also be targeted at other transfer layers, for example, streaming gRPC. HTTP/1.1 may not be able to expose all of that design though improvements like early termination can be fitted.

VladimirAlexiev commented 2 years ago

@jaw111 thanks for the input!

necessary to have some way to indicate a JSON-LD frame

Yes, unless you have #39, #48, #73, #128 :-)

the JSON-LD frame IS the query

I think you're talking GraphQL here :-)

jaw111 commented 2 years ago

I think you're talking GraphQL here :-)

I was not able to come to terms with GraphQL-LD, still prefer SPARQL.

There is definitely some overlap between JSON-LD frames and GraphQL.

VladimirAlexiev commented 6 months ago

Just a note that @butaloto is working to upgrade our NDJSONLD implementation https://github.com/eclipse-rdf4j/rdf4j/issues/2840 to JSONLD 1.1

w3c / sparql-dev

ndjson-ld format #140

Why?

Previous work

Proposed solution

Considerations for backward compatibility