wardi / jsonlines

Documentation for the JSON Lines text file format
http://jsonlines.org
139 stars 34 forks source link

Standard MIME content-type #19

Open pavelnikolov opened 8 years ago

pavelnikolov commented 8 years ago

What do you think about adding new HTTP content-type for jsonlines data. What about application/jsonl?

jbaehr commented 5 years ago

I'd rather prefer application/json-lines otherwise it may look like a typo ;-)

In addition to the Media Type, a registered structured suffix may be interesting. In my eyes even more useful, to create media types like application/vnd.my-company.some-thing+json-lines.

See also: https://www.iana.org/assignments/media-types/ https://www.iana.org/assignments/media-type-structured-suffix/

@wardi have you considered filing a registration for a json-lines Media Type and structured suffix at IANA?

karmakaze commented 5 years ago

There is an IETF RFC 7464 for JSON Text Sequences that uses mime type: application/json-seq

It allows prefixing each JSON record with <RS> control character and requires ending each JSON record with <LF>.

Also see: https://en.wikipedia.org/wiki/JSON_streaming

jbaehr commented 5 years ago

This seems like a duplicate of #9. The whole purpose of the Content-Type header is to communicate the media type.

whlavina commented 3 years ago

The lack of a definitive IANA Media Type for JSON Lines causes some difficulty for those of us using the format. In the interest of pushing the issue, I took the liberty of starting a conversation: https://mailarchive.ietf.org/arch/msg/json/dWMWD0JDa2HiUYjWjLjrQExeIx4/

Perhaps someone here would like to join that thread?

Disclaimer: I am in no way affiliated with the IANA/IETF. I am merely interested in using the format, correctly.

sp4ce commented 1 year ago

@whlavina the response from Tim Bray was the most helpful and it looks nothing had happened since then. I'll copy the interesting bit here for reference

to register a media type you need to link to a stable specification. The contents of https://jsonlines.org/ probably don’t qualify, so the conventional thing would be to write an Internet-Draft which AFAICT would be the same as json-seq only without the leading "ASCII Record Separator (0x1E)" but retaining the trailing \n.

sp4ce commented 1 year ago

I am linking the relevant RFC to suggest new MIME type for standardisation:

https://www.rfc-editor.org/rfc/rfc6838.html

I propose working on adding the mime type application/jsonl into the standard tree (section 3.1). Adding to the standard tree seems the most convoluted, but also, I think this is where it would fit the best.

Among the two ways they list to get it added to the standard tree:

  1. in the case of registrations associated with IETF specifications, approved directly by the IESG, or

    1. registered by a recognized standards-related organization using the "Specification Required" IANA registration policy [RFC5226] (which implies Expert Review).

I think the second one is the most relevant, which leads to https://www.rfc-editor.org/rfc/rfc5226

https://www.iana.org/form/media-types

frederikb commented 1 year ago

Hi @sp4ce, good to see that someone is leading the way to an actual RFC!

I've noticied that AWS is (apparently) using JSON Lines for one of their products. I haven't seen a description of the actual output to know whether or not it is compatible with JSON Lines. In any case they are using the mime type application/jsonlines. Thoughts on application/jsonl vs. that one?

tim-hitchins-ekkosense commented 1 year ago

AWS Claim it's compatible with JSON Lines - it links to the JSON Lines homepage

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html

dwaite commented 9 months ago

If there's still interest in doing this, I would recommend an informational track internet-draft (I-D) to describe the jsonlines specification, with an IANA considerations section registering the media type. The idea is that drafts work towards RFCs work towards standards on a long evolutionary track of internet draft to RFC, and potentially to being an internet standard.

IETF wants to deal with immutable and permanently available documents, so you will likely need represent the encoding and parsing requirements authoritatively within the I-D itself, using IETF nomenclature. There's a lot of references to this available, and the JSON Text Sequences RFC is likely an excellent example.

I suspect there will be feedback that some areas are not needed. For example, your UTF-8 encoding rule does not have much left to it once you reference the JSON RFC. That RFC already mandates UTF-8 for everything other than closed ecosystems.At that point, you have to decide whether the application "advice" that they might want to escape the string to work on ASCII transports becomes something you might want to represent as an application note on the jsonlines site, and a discussion you have with the IETF more broadly - after all, it would also affect JSON and json sequence data over such transports.

Conversely, you may want to be quite a bit more specific for the sake of interoperability, such as whether applications MUST be able to consume \r\n line separators, and what application behavior is mandated/desired if invalid JSON text (including things like lines of just whitespace) are encountered within a stream. Variance in behaviors have led to a lot of security issues - imagine if your security compliance or logging components stopped reading a JSON lines sequence at a newline, while your application logic ignored the blank line and kept going.

finwo commented 9 months ago

What's wrong with what ndjson is trying to implement? Their current standard is application/x-ndjson, which will likely move to application/ndjson in the future when there's more adoption.

https://bugzilla.mozilla.org/show_bug.cgi?id=1603986

dwaite commented 9 months ago

The x- prefix on a subtype is intended only for private use, e.g. for types with no expectation of interoperability between implementations. In that sense, your application/x-ndjson may conflict with other people's application/x-ndjson, such as presence or absence of a leading [ or of trailing ,, or even someone deciding they might as well send it in Big5 rather than UTF-8.

The lack of an immutable standard (like a RFC with a number) means that ndjson three years from now may make changes along lines like these for robustness, but implementations do not have a clear way to explain what they are compatible with.

There are plenty of commercial products which use vendor and x-prefixed media types, and which do not attempt to define fixed/robust/interoperable behavior. It is a matter of what this project is going for, which is why my first words were "If there's still interest in doing this".

In terms of ramifications, most SDOs (standard defining organizations) won't touch dependencies which do not have these and other formalisms, and may use things like publication in another SDO (like IETF) as a sign of that. That means ndjson/jsonlines may be used in public facing API, but a large category of interoperable standards work either wouldn't touch it, or will standardize their own similar effort.

tim-hitchins-ekkosense commented 9 months ago

which will likely move to application/ndjson in the future when there's more adoption

Well that's the problem, it might happen, at some point in the future. Given the usage of JSON lines in various commercial products, we're suggesting we do that formalisation now - or at least start the process very soon!

wardi commented 9 months ago

I'd love to see this.

So do we copy-paste JSON-SEQ https://datatracker.ietf.org/doc/html/rfc7464 without the "ASCII Record Separator (0x1E)"? JSON-SEQ discusses detecting truncated records and continuing a fair bit, all of that could be removed in a new RFC.

Conversely, you may want to be quite a bit more specific for the sake of interoperability, such as whether applications MUST be able to consume \r\n line separators, and what application behavior is mandated/desired if invalid JSON text (including things like lines of just whitespace) are encountered within a stream. Variance in behaviors have led to a lot of security issues - imagine if your security compliance or logging components stopped reading a JSON lines sequence at a newline, while your application logic ignored the blank line and kept going.

Rule 3 in https://jsonlines.org/ mentions that a compliant parser will be able to consume \r\n because \r is ignored as surrounding whitespace by a json parser. Doesn't hurt to repeat it though.

Lines of only whitespace are already invalid by rule 2 in https://jsonlines.org/ , but again it doesn't hurt to make this clear.

To be specific let's say that any line that doesn't parse as valid JSON should be treated as an invalid record but still counts as a record for the purpose of numbering the lines.

GabenGar commented 9 months ago

Should it count as a record? The whole point of something called JSON Lines is that it stores lines of a well defined format called JSON, not arbitrary character sequences. Depending on the nature on malformed data in a line it might as well make all other lines after it invalid and blow up logs with parsing errors noise when the offender is a single line (a whole file).

timtjtim commented 9 months ago

So do we copy-paste JSON-SEQ

I think RFCs are copyrighted so to copy paste you would need permission of the original author

whlavina commented 9 months ago

I'm glad to see continued discussion and forward movement. It's interesting to see that YAML just recently (this month) gained IANA media type registration... 22 years after the format was first created. If YAML can do it, JSON Lines can, too! If there's any need for help with the process, maybe we could ask the folks who pushed the YAML RFC?

tim-hitchins-ekkosense commented 9 months ago

Here's the guidelines on how to write an Internet Draft

https://authors.ietf.org/en/home

darrelmiller commented 5 months ago

@whlavina You folks are welcome to come join the HTTPAPI mailing list https://datatracker.ietf.org/wg/httpapi/about/ and we can chat about a path to registering this media type. This is where the YAML media type registration RFC was created and we are working towards the OpenAPI one also.

There is ongoing discussion about allowing mediatype registrations to happen in the standards tree without necessarily going through the process of writing an RFC for the format. https://www.ietf.org/archive/id/draft-ietf-mediaman-standards-tree-00.html Although, this format might be simple enough that an RFC would straightforward.

ferdnyc commented 4 months ago

There is ongoing discussion about allowing mediatype registrations to happen in the standards tree without necessarily going through the process of writing an RFC for the format. https://www.ietf.org/archive/id/draft-ietf-mediaman-standards-tree-00.html Although, this format might be simple enough that an RFC would straightforward.

As of last month, that (expired) draft is replaced by https://www.ietf.org/archive/id/draft-ietf-mediaman-standards-tree-01.html