wardi / jsonlines

Documentation for the JSON Lines text file format
http://jsonlines.org
130 stars 32 forks source link

jsonlines.org and ndjson.org #22

Open max-mapper opened 7 years ago

max-mapper commented 7 years ago

hey I noticed http://ndjson.org/ and http://jsonlines.org/ are very similar, I was just wondering if maybe they could link to each other to reduce confusion? I like both names personally and use them interchangeably

cc @chrisdew

wardi commented 7 years ago

This site links to ndjson from http://jsonlines.org/on_the_web/ and ndjson.org links back here from its footer, is that not sufficient?

karmakaze commented 6 years ago

In the interest of converging on a single standard, would it not be beneficial for these two sites co-ordinate and agree on items, and ideally just be one .org. Having two similar sites each promoting an emerging 'standard' with differences gives a sense it's not ready for interoperation.

glensc commented 3 years ago

created similar issue in the ndjson repo:

pekkaklarck commented 3 years ago

:+1: for just one standard and one web site. Based on a quick look there aren't any real differences except to the extension (.jsonl vs .ndjson). Having a common extension would make it more likely that editors and IDEs support this format without extra configuration.

onacit commented 2 years ago

Cross linking is not sufficient. Anyone would google with v.s.. which might lead to here.

I'm sure the author was not confused. :) But I'm confused.

This format is specified at ndjson.org and documented at the JSON Lines website.

- https://en.wikipedia.org/wiki/JSON_streaming

jsejcksn commented 2 years ago

Observation: If repository issue activity is any metric for discoverability, then JSON Lines has an advantage.

In the interest of converging on a single standard, would it not be beneficial for these two sites co-ordinate and agree on items, and ideally just be one .org. Having two similar sites each promoting an emerging 'standard' with differences gives a sense it's not ready for interoperation.

I understand that, historically, there were spec differences due to potential ambiguity (UTF-8 encoding, required JSON data on every line, etc.), but it seems as though they are now aligned. At this point in time, are there any remaining spec differences? And are there any other issues which are preventing convergence (e.g. copyright credit, etc.)? I think the community will greatly benefit from a single, unified standard with an RFC, registered IANA media type, etc. The involved parties appear to be reasonable and responsive. Can we make this happen?

wardi commented 2 years ago

I prefer the name "JSON lines" because that seemed like the obvious name to me :-) but, the ndjson folks did go the extra mile and write a spec.

If we're fully aligned I like the idea of settling on a single name. Is there an unbiased measure we can use for deciding?

jsejcksn commented 2 years ago

Is there an unbiased measure we can use for deciding?

@wardi Names are names and will always be arbitrary/subjective. 😅 I think it's just up to the party that submits the RFC and registers. IMO, a unified standard with either name is better than two ambiguously identical alternatives.

remram44 commented 2 years ago

The ndjson repo hasn't seen any maintainer activity in years. That makes it both impossible to pick this and have them redirect, and a bad idea to pick them and redirect from here.

pekkaklarck commented 1 year ago

The owner of the ndjson domain seems to be fine going forward with jsonlines.org.

stokito commented 1 year ago

This is a mess. Let's finally get to some decision. My proposition is to take the already existing JSON Text Sequences RFC 7464 and enrich it with additions: add a file extension jsonl and make the usage of the RS symbol optional and the LF too.

A good overview of all streaming formats https://en.wikipedia.org/wiki/JSON_streaming

It's basic idea to have "unambiguous JSON" resilient to many forms of damage such as truncation, multiple writers incorrectly configured to write to the same file, corrupted JSON, etc.  An example sequence:

    ␞{"d":"2014-09-22T21:58:35.270Z","value":6}␤     ␞{"d":"2014-09-22T21:59:15.117Z","value":12}␤

From the spec:

 Phillip Hallam-Baker proposed the use of JSON text sequences for  logfiles and pointed out the need for resynchronization.  Stephen  Dolan created https://github.com/stedolan/jq, which uses something  like JSON text sequences (with LF as the separator between texts on  output, and requiring only such whitespace as needed to disambiguate  on input). Carsten Bormann suggested the use of ASCII RS, and Joe  Hildebrand suggested the use of LF in addition to RS for  disambiguating top-level number values.

So basically for a simplest case when I know that the data is not corrupted I can simply use a concatenated JSON. I can use line separators too and they'll just ignored as in usual JSON. The only one requirement is from the parser to accept multiple documents.

Example 1:

{"id":1}{"id":2}

Example 2: two documents but formatted with a newline

{
  "id":1
}
{
  "id":2
}

If I may have corrupted JSONs then a newline may be used. But here may be a problem to distinguish when the newline was used just for a formatting and when to split two documents.

Example 3: the first document is broken and doesn't have a closing bracket but \n anyway allows to split them

{"id":
{"id":2}

Example 4: first doc is broken, then newline, and the second doc is formatted with a newline

{"id":
{
  "id": 2,
  "props": {
    "prop1": 1,
    "prop2": 2
  }
}

But visually we still can distinguish where the first doc ends and the second starts. And we can use a simple rule: sequence \n{ separates the next document. E.g. { at the start of a line without indentation. But when there is \n some spaces and only then continue the document until finding the closing bracket. I think the simple rule should work almost always. But anyway the indented JSON makes a little sense for the JSON streaming and not expected.

If I need to have top level values then the RS may be used optionally. Anyway this is something that a producer may decide to use the RS or not. In any case a parser may be just configured to require the RS if it expects top level values or broken data e.g. he needs for the "unambiguous JSON". E.g. this should be an option of the format but not a requirement. As for me the RS at the beginning still makes a little sense for unambiguous because on threading issues you may just have lines intermixed. It looks like overengineering. But probably it came from real world usage and problems so I'm not sure.

@nicowilliams you are the author of the RFC 7464. Please give us your thoughts. Is it possible to make some errata for the spec?

cc: @hoegertn @finnp @wardi

Related: already was discussed an idea to use the application/json-seq as a MIME for the JSONL #19

The file extension: both ndjson and jsonl are easy to google. The jsonl files are easier to pronounce, easier to read at first sight and also they'll sort more naturally with existing json files. The mime type is json-seq so a file extension jsons would be more appropriate but may cause confusion in a conversation. So IMHO the existing jsonl should be better

wardi commented 1 year ago

@stokito updating RFC 7464 as you describe sounds good to me.

sp4ce commented 1 year ago

Could we include the MIME type application/jsonl that seem to be used already by others and is suggested in https://github.com/wardi/jsonlines/issues/19?

sp4ce commented 1 year ago

@stokito

from issue in https://github.com/wardi/jsonlines/issues/65#issue-1604557768 I don't think jsonlines is going into any direction to allow incomplete record, empty lines, or other type of linebreaks that doesn't separate valid JSON records.

I am not sure amending RFC 6474 will be valid in that context. The examples you gave seems to allow that.

To me streaming JSON is a whole other problem, I think jsonlines is about a succession of valid JSON, like you would do a succession of API call for batching input or reading some process results (we've been using it with Amazon Comprehend to manage training corpus for example, or the recognition job inputs)

ciscorucinski commented 1 year ago

Imagine the file extension being a format like .lines.json or .stream.json

Taking inspiration from:

.stream.json keeps with the idea that .x.y means it is a y file but for x

remram44 commented 1 year ago

The difference is that a .gradle.kts file is a valid .kts file, and a .tar.gz is a valid .gz file. A lines.json is not a valid JSON file, since it contains multiple JSON objects. It needs to be split before it yields valid JSON documents.

So .json.lines would make more sense if anything.

ciscorucinski commented 1 year ago

Point taken.

I would still be for .json.stream. It's a higher-level concept that fits all current json streaming formats (I mean the concept is already called streaming).

A good overview of all streaming formats https://en.wikipedia.org/wiki/JSON_streaming

  • Just a concatenated JSON. Each bracket must be paired with a closed bracket. No any spec on this.
  • The NDJSON: separator \n (LF), on parsing accepts \r\n. File ext: .ndjson, MIME: application/x-ndjson
  • The JSON Lines : separator \n, on parsing accepts \r\n. File ext: .jsonl, MIME: none
  • The RFC 7464 File ext: none, MIME: application/json-seq and it's registered IANA. Additionally it uses a RS symbol:

Anyways, just throwing this out. Glad this concept has been seen. Seems like all emoji interactions like the concept, but just preferred it swapped around. I'm completely down for that.

remram44 commented 1 year ago

I would rather see an extension that specifically says which it is. We don't use .img for PNG, JPG, BMP, and TIF. Similarly I think those 4 (well, 3) different formats should have different extensions.