qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
29 stars 15 forks source link

Add multiple=true() option to fn:parse-json and fn:json-doc #235

Closed michaelhkay closed 7 months ago

michaelhkay commented 2 years ago

It is common practice (though not, I believe, covered by any standard) to have files that contain multiple JSON objects. Often these will be arranged one per line, as in our own qt3tests use case R31 at https://github.com/w3c/qt3tests/blob/master/app/UseCaseR31/sales.json . In that example, the file can be parsed using unparsed-text-lines()!parse-json(). But in the more general case, where each object may itself be multi-line, there's no easy way of handling this.

I propose an option multiple=true() on fn:parse-json and fn:json-doc that enables parsing of an input containing multiple (zero or more) concatenated JSON texts. When this option is present, the result will always be delivered as an array, containing one member for each JSON text in the input. The wrapper array will be present even if the number of JSON texts in the input is zero or one.

If a JSON text ends with a letter or digit and the next JSON text starts with a letter or digit then they must be separated by whitespace.

ndw commented 2 years ago

I think this is a good idea, but I don't understand the last paragraph. What does it mean for a JSON text to end with a letter or digit if there isn't whitespace after it. I infer that

three
four

(Or, I suppose, three four on a single line.)

Would parse as ["three", "four"]. But the only interpretation I can make of the last paragraph is that the input is

threefour

which must surely parse as ["threefour"].

MHK: The only JSON texts that can end in a letter or number are (a) numbers, or (b) true, false, and null. So you can write 12 true true 5 but you can't write 12truetrue5.

ChristianGruen commented 2 years ago

We should also clarify if "A""B" is a legal input, or if it must be "A" "B". Other cases: "A"{"B":"C"}, []1, …

Maybe, multiple JSON fragments should always be separated by at least one whitespace character?

MHK: that would certainly be simpler.

dnovatchev commented 2 years ago

So, what separator will be used between the JSON (objects') texts ?

Just assuming "whitespace separation" seems to be rather error-prone. If the input contains a syntactically invalid JSON text, then this could be parsed, quite misleadingly, as two or more JSON objects.

Or am I completely misunderstanding this?

michaelhkay commented 2 years ago

There's no ambiguity. The top level construct in the JSON grammar is an object, array, string, number, or the keyword true, false, or null. When you've read one of those, the only thing that can follow at present is EOF. This proposal changes this so instead of EOF, you can have another top-level construct.

You'll find lots of people asking how to parse files that contain multiple JSON objects. The nearest thing to a standard is "json lines" - https://jsonlines.org - which holds one JSON value per line; but there's no reason to restrict it that way, it's just as easy to allow multiple JSON values (which may contain newlines) separated by arbitrary whitespace.

It's true that erroneous JSON might be mis-parsed. But this mode of parsing won't be the default, so people will only use it if this is the input format they need to handle.

benibela commented 2 years ago

JSONiq had a jsoniq-multiple-top-level-items option for that

liberal could parse anything

dnovatchev commented 2 years ago

It's true that erroneous JSON might be mis-parsed. But this mode of parsing won't be the default, so people will only use it if this is the input format they need to handle.

Maybe it would be even better to allow the user to specify a particular JSON-document string delimiter (with default some whitespace) so that the chance for such accidental errors could be minimized?

ChristianGruen commented 1 year ago

In terms of coherence, a dedicated fn:parse-json-fragments may be the better choice (unless we add a multiple option for fn:parse-xml and fn:doc).

michaelhkay commented 7 months ago

I propose to drop this issue, on the grounds that parsing of JSON Lines input can be readily achieved using

array{unparsed-text-lines($input) =!> parse-json()}

Note that JSON Lines does NOT allow multiple arbitrary JSON texts to be simply concatenated with newline separators as suggested in the original proposal. Each line of the input has to be a JSON text, which means newlines can only be used to separate JSON texts, not to separate tokens within a JSON text. It's therefore possible to start by splitting the input into lines, and then parsing each line.

ndw commented 7 months ago

The CG agreed to close this issue without further action at meeting 074