Closed michaelhkay closed 7 months ago
I think this is a good idea, but I don't understand the last paragraph. What does it mean for a JSON text to end with a letter or digit if there isn't whitespace after it. I infer that
three
four
(Or, I suppose, three four
on a single line.)
Would parse as ["three", "four"]
. But the only interpretation I can make of the last paragraph is that the input is
threefour
which must surely parse as ["threefour"]
.
MHK: The only JSON texts that can end in a letter or number are (a) numbers, or (b) true, false, and null. So you can write 12 true true 5
but you can't write 12truetrue5
.
We should also clarify if "A""B"
is a legal input, or if it must be "A" "B"
.
Other cases: "A"{"B":"C"}
, []1
, …
Maybe, multiple JSON fragments should always be separated by at least one whitespace character?
MHK: that would certainly be simpler.
So, what separator will be used between the JSON (objects') texts ?
Just assuming "whitespace separation" seems to be rather error-prone. If the input contains a syntactically invalid JSON text, then this could be parsed, quite misleadingly, as two or more JSON objects.
Or am I completely misunderstanding this?
There's no ambiguity. The top level construct in the JSON grammar is an object, array, string, number, or the keyword true, false, or null. When you've read one of those, the only thing that can follow at present is EOF. This proposal changes this so instead of EOF, you can have another top-level construct.
You'll find lots of people asking how to parse files that contain multiple JSON objects. The nearest thing to a standard is "json lines" - https://jsonlines.org - which holds one JSON value per line; but there's no reason to restrict it that way, it's just as easy to allow multiple JSON values (which may contain newlines) separated by arbitrary whitespace.
It's true that erroneous JSON might be mis-parsed. But this mode of parsing won't be the default, so people will only use it if this is the input format they need to handle.
JSONiq had a jsoniq-multiple-top-level-items
option for that
liberal
could parse anything
It's true that erroneous JSON might be mis-parsed. But this mode of parsing won't be the default, so people will only use it if this is the input format they need to handle.
Maybe it would be even better to allow the user to specify a particular JSON-document string delimiter (with default some whitespace) so that the chance for such accidental errors could be minimized?
In terms of coherence, a dedicated fn:parse-json-fragments
may be the better choice (unless we add a multiple
option for fn:parse-xml
and fn:doc
).
I propose to drop this issue, on the grounds that parsing of JSON Lines input can be readily achieved using
array{unparsed-text-lines($input) =!> parse-json()}
Note that JSON Lines does NOT allow multiple arbitrary JSON texts to be simply concatenated with newline separators as suggested in the original proposal. Each line of the input has to be a JSON text, which means newlines can only be used to separate JSON texts, not to separate tokens within a JSON text. It's therefore possible to start by splitting the input into lines, and then parsing each line.
The CG agreed to close this issue without further action at meeting 074
It is common practice (though not, I believe, covered by any standard) to have files that contain multiple JSON objects. Often these will be arranged one per line, as in our own qt3tests use case R31 at https://github.com/w3c/qt3tests/blob/master/app/UseCaseR31/sales.json . In that example, the file can be parsed using
unparsed-text-lines()!parse-json()
. But in the more general case, where each object may itself be multi-line, there's no easy way of handling this.I propose an option multiple=true() on
fn:parse-json
andfn:json-doc
that enables parsing of an input containing multiple (zero or more) concatenated JSON texts. When this option is present, the result will always be delivered as an array, containing one member for each JSON text in the input. The wrapper array will be present even if the number of JSON texts in the input is zero or one.If a JSON text ends with a letter or digit and the next JSON text starts with a letter or digit then they must be separated by whitespace.