Open Downchuck opened 11 years ago
The problem with this is that it's not easy to know how much stuff to throw out, or how much stuff to skip. In particular there's no way to rewind the jv parser's state, but more generally there's no obvious right answer for how to handle malformed inputs.
A better option would be to tolerate specific malformations as an option. For example, [1 2 3]
should be accepted as [1, 2, 3]
, and [1,]
should be accepted as [1]
, and so on.
It would be nice to have a spec for "dodgy json", and implement a parser for this wide set of strings that are sorta kinda maybe like json data, possibly as an entirely separate codepath. I don't really want to add a bunch of special cases to the current parser.
@stedolan I added some special cases, but not a lot to the current parser,
and it seems to work fine (valgrind'ed and all). It's kinda neat,
actually, though I just realized that it doesn't handle [1 2 3]
correctly. Basically, the main dodgy thing I'd like to have an option to
tolerate is extra and missing commas, because they're kinda unnecessary.
The "on parse error skip input till newline" thing is also fairly neat (I
forget whose idea that was; not mine).
I'm not sure what a spec for dodgy JSON would look like, and, in any case, JSON is fairly stable now, so the only further changes I can think of would be:
Perhaps there will be new extensions for a JSON-like encoding not named JSON, but I'm not sure that you (or I) would want to bother.
So I think the parser should be stable, and these handful of special cases should be OK.
But if you like I won't push these. Or, alternatively, if you could give me more direction. Should I create a jv_parse_dodgy.c with mostly a copy of jv_parse.c but with all the special cases (minus the flags)?
@stedolan Yeah, I think you're right that this doesn't belong in jv_parse.c. It belongs either in an external pre-processor, or perhaps libjv should have multiple parsers (considering the plethora of binary-JSONs out there...).
Actually, something like this will be needed for the JSON text sequence RFC (if it gets published). The problem is that if you have writers appending to logfiles, it's possible to end up with truncated writes (yes, even if using O_APPEND; it's a long story). So it's desirable to be able to recover. See the json@ietf.org list archives for more details. An option to recover is not difficult, but it has to be an option.
I just need, throw away all state and continue to the next line.
The version in master has a --seq
flag that implements
https://tools.ietf.org/html/draft-ietf-json-text-sequence-09 which is
basically:
This isn't released yet, I know.
Just stumbled on this because I was attempting to use jq
to process the output of MongoDB and it puts in non-JSON stuff in the output -- i.e.:
vagrant@vagrant-ubuntu-trusty-64:~$ echo 'db.runCommand( { serverStatus: 1, workingSet: 1 } )' | mongo --quiet | jq '.'
parse error: Invalid numeric literal at line 7, column 38
The problem is the NumberLong
and ISODate
stuff that mongo throws in there:
vagrant@vagrant-ubuntu-trusty-64:~$ echo 'db.runCommand( { serverStatus: 1, workingSet: 1 } )' | mongo --quiet | head
{
"host" : "vagrant-ubuntu-trusty-64",
"version" : "2.4.9",
"process" : "mongod",
"pid" : 9730,
"uptime" : 3311,
"uptimeMillis" : NumberLong(3311086),
"uptimeEstimate" : 1672,
"localTime" : ISODate("2015-01-07T17:08:06.611Z"),
So this is a possible use case and test case for this feature.
@msabramo - The "j" is "jq" stands for JSON, so unless jq changes its spots, the best way to use jq with "quasi-JSON" may be with the aid of two filters: one to transform the quasi-JSON to JSON, and another to recover the quasi-JSON. Here is a simple filter that might be useful. If there are other non-JSON entities that it should handle to be useful in the MongoDB context, please let us know.
#!/bin/bash
# qjson2json
# Version 0.0.1
# Syntax: $0 [-i | --inverse]
# This is a filter for transforming quasi-JSON to JSON, or with the -i option, for transforming JSON back to quasi-JSON.
#
# Currently, the quasi-JSON may contain non-JSON entities such as these:
#
# NumberLong(3311086)
# ISODate("2015-01-07T17:08:06.611Z")
#
# This script is intended to be useful but is not robust; specifically, it makes two assumptions:
# 1. there are no strings that happen to contain the patterns;
# 2. it is sufficient to transform the first occurrence of each pattern on a line
#
# If these assumptions are met, the round-trip should result in the original file.
case "$1" in
-i | --inverse )
sed -e 's/"\(NumberLong([0-9][0-9]*)\)"/\1/' -e 's/"ISODate(\([0-9][-0-9:.TZ]*\))"/ISODate("\1")/'
exit
;;
esac
sed -e 's/\(NumberLong([0-9][0-9]*)\)/"\1"/' -e 's/ISODate("\([0-9][-0-9:.TZ]*\)")/"ISODate(\1)"/'
The data model is JSON's, but the input and output formats could vary. That is, we could support multiple similar formats (e.g., one or more of the binary JSON formats, ...).
In a sense jq's format is a sequence of JSON texts, and this led to a new MIME format to come out soon and which is already supported in 1.5rc1 (see the -- seq option).
A parser for this MongoDB format should be possible, but the type hinting will be lost on output (though an encoder could restore it on the basis of schema or heuristics). jq's internal type system will not evolve to gain new types (bignums, yes, eventually, but that's not a new type, just a better implementation of numbers, for some value of "better").
@pkoppstein you're missing on the "Timestamp" function as well.
#!/bin/bash
# qjson2json
# Version 0.0.2
# Syntax: $0 [-i | --inverse]
# This is a filter for transforming quasi-JSON to JSON, or with the -i option, for transforming JSON back to quasi-JSON.
#
# Currently, the quasi-JSON may contain non-JSON entities such as these:
#
# NumberLong(3311086)
# ISODate("2015-01-07T17:08:06.611Z")
# Timestamp(1473407008, 1),
#
# This script is intended to be useful but is not robust; specifically, it makes two assumptions:
# 1. there are no strings that happen to contain the patterns;
# 2. it is sufficient to transform the first occurrence of each pattern on a line
#
# If these assumptions are met, the round-trip should result in the original file.
case "$1" in
-i | --inverse )
sed -e 's/"\(NumberLong([0-9][0-9]*)\)"/\1/' -e 's/"ISODate(\([0-9][-0-9:.TZ]*\))"/ISODate("\1")/' -e 's/"\(Timestamp.*\)",/\1,/'
exit
;;
esac
sed -e 's/\(NumberLong([0-9][0-9]*)\)/"\1"/' -e 's/\(ISODate(\)"\([0-9][-0-9:.TZ]*\)")/"\1\2)"/' -e 's/\(Timestamp.*\),/"\1",/'
I also wanted this functionality, specifically to be able to be lenient enough to handle valid JS Objects.
My solution was this one-liner (needs NodeJS). Just in case you, fair reader, also have this problem.
echo 'console.log(JSON.stringify('`cat imperfect.json `',null,2));' | node | jq '.'
Sometimes data is dirty; add a flag to --allow-parse-errors which would skip lines which trigger a parse error in the input stream.