Add option for forgiving JSON parser / multiple parser backends

Downchuck commented 11 years ago

Sometimes data is dirty; add a flag to --allow-parse-errors which would skip lines which trigger a parse error in the input stream.

nicowilliams commented 10 years ago

The problem with this is that it's not easy to know how much stuff to throw out, or how much stuff to skip. In particular there's no way to rewind the jv parser's state, but more generally there's no obvious right answer for how to handle malformed inputs.

A better option would be to tolerate specific malformations as an option. For example, [1 2 3] should be accepted as [1, 2, 3], and [1,] should be accepted as [1], and so on.

stedolan commented 10 years ago

It would be nice to have a spec for "dodgy json", and implement a parser for this wide set of strings that are sorta kinda maybe like json data, possibly as an entirely separate codepath. I don't really want to add a bunch of special cases to the current parser.

nicowilliams commented 10 years ago

@stedolan I added some special cases, but not a lot to the current parser, and it seems to work fine (valgrind'ed and all). It's kinda neat, actually, though I just realized that it doesn't handle [1 2 3] correctly. Basically, the main dodgy thing I'd like to have an option to tolerate is extra and missing commas, because they're kinda unnecessary. The "on parse error skip input till newline" thing is also fairly neat (I forget whose idea that was; not mine).

I'm not sure what a spec for dodgy JSON would look like, and, in any case, JSON is fairly stable now, so the only further changes I can think of would be:

bigreal/bignum related
support for one or more of the incipient "binary JSON" encodings

Perhaps there will be new extensions for a JSON-like encoding not named JSON, but I'm not sure that you (or I) would want to bother.

So I think the parser should be stable, and these handful of special cases should be OK.

But if you like I won't push these. Or, alternatively, if you could give me more direction. Should I create a jv_parse_dodgy.c with mostly a copy of jv_parse.c but with all the special cases (minus the flags)?

nicowilliams commented 10 years ago

@stedolan Yeah, I think you're right that this doesn't belong in jv_parse.c. It belongs either in an external pre-processor, or perhaps libjv should have multiple parsers (considering the plethora of binary-JSONs out there...).

nicowilliams commented 10 years ago

Actually, something like this will be needed for the JSON text sequence RFC (if it gets published). The problem is that if you have writers appending to logfiles, it's possible to end up with truncated writes (yes, even if using O_APPEND; it's a long story). So it's desirable to be able to recover. See the json@ietf.org list archives for more details. An option to recover is not difficult, but it has to be an option.

nicowilliams commented 10 years ago

The I-D is http://tools.ietf.org/html/draft-ietf-json-text-sequence-04 .

Downchuck commented 9 years ago

I just need, throw away all state and continue to the next line.

nicowilliams commented 9 years ago

The version in master has a --seq flag that implements https://tools.ietf.org/html/draft-ietf-json-text-sequence-09 which is basically:

prefix each JSON text with ASCII RS, follow each JSON text with ASCII newline
recover gracefully from syntax errors by scanning ahead to the next ASCII RS.

This isn't released yet, I know.

msabramo commented 9 years ago

Just stumbled on this because I was attempting to use jq to process the output of MongoDB and it puts in non-JSON stuff in the output -- i.e.:

vagrant@vagrant-ubuntu-trusty-64:~$ echo 'db.runCommand( { serverStatus: 1, workingSet: 1 } )' | mongo --quiet | jq '.'
parse error: Invalid numeric literal at line 7, column 38

The problem is the NumberLong and ISODate stuff that mongo throws in there:

vagrant@vagrant-ubuntu-trusty-64:~$ echo 'db.runCommand( { serverStatus: 1, workingSet: 1 } )' | mongo --quiet | head
{
    "host" : "vagrant-ubuntu-trusty-64",
    "version" : "2.4.9",
    "process" : "mongod",
    "pid" : 9730,
    "uptime" : 3311,
    "uptimeMillis" : NumberLong(3311086),
    "uptimeEstimate" : 1672,
    "localTime" : ISODate("2015-01-07T17:08:06.611Z"),

So this is a possible use case and test case for this feature.

pkoppstein commented 9 years ago

@msabramo - The "j" is "jq" stands for JSON, so unless jq changes its spots, the best way to use jq with "quasi-JSON" may be with the aid of two filters: one to transform the quasi-JSON to JSON, and another to recover the quasi-JSON. Here is a simple filter that might be useful. If there are other non-JSON entities that it should handle to be useful in the MongoDB context, please let us know.

#!/bin/bash
# qjson2json
# Version 0.0.1

# Syntax: $0 [-i | --inverse]
# This is a filter for transforming quasi-JSON to JSON, or with the -i option, for transforming JSON back to quasi-JSON.
#
# Currently, the quasi-JSON may contain non-JSON entities such as these:
#
# NumberLong(3311086)
# ISODate("2015-01-07T17:08:06.611Z")
#
# This script is intended to be useful but is not robust; specifically, it makes two assumptions:
# 1. there are no strings that happen to contain the patterns;
# 2. it is sufficient to transform the first occurrence of each pattern on a line
#
# If these assumptions are met, the round-trip should result in the original file.

case "$1" in
 -i | --inverse )
    sed -e 's/"\(NumberLong([0-9][0-9]*)\)"/\1/' -e 's/"ISODate(\([0-9][-0-9:.TZ]*\))"/ISODate("\1")/'
    exit
    ;;
esac

sed -e 's/\(NumberLong([0-9][0-9]*)\)/"\1"/' -e 's/ISODate("\([0-9][-0-9:.TZ]*\)")/"ISODate(\1)"/'

nicowilliams commented 9 years ago

The data model is JSON's, but the input and output formats could vary. That is, we could support multiple similar formats (e.g., one or more of the binary JSON formats, ...).

In a sense jq's format is a sequence of JSON texts, and this led to a new MIME format to come out soon and which is already supported in 1.5rc1 (see the -- seq option).

nicowilliams commented 9 years ago

A parser for this MongoDB format should be possible, but the type hinting will be lost on output (though an encoder could restore it on the basis of schema or heuristics). jq's internal type system will not evolve to gain new types (bignums, yes, eventually, but that's not a new type, just a better implementation of numbers, for some value of "better").

yarons commented 8 years ago

@pkoppstein you're missing on the "Timestamp" function as well.

#!/bin/bash
# qjson2json
# Version 0.0.2

# Syntax: $0 [-i | --inverse]
# This is a filter for transforming quasi-JSON to JSON, or with the -i option, for transforming JSON back to quasi-JSON.
#
# Currently, the quasi-JSON may contain non-JSON entities such as these:
#
# NumberLong(3311086)
# ISODate("2015-01-07T17:08:06.611Z")
# Timestamp(1473407008, 1),
#
# This script is intended to be useful but is not robust; specifically, it makes two assumptions:
# 1. there are no strings that happen to contain the patterns;
# 2. it is sufficient to transform the first occurrence of each pattern on a line
#
# If these assumptions are met, the round-trip should result in the original file.

case "$1" in
 -i | --inverse )
    sed -e 's/"\(NumberLong([0-9][0-9]*)\)"/\1/' -e 's/"ISODate(\([0-9][-0-9:.TZ]*\))"/ISODate("\1")/' -e 's/"\(Timestamp.*\)",/\1,/'
    exit
    ;;
esac
sed -e 's/\(NumberLong([0-9][0-9]*)\)/"\1"/' -e 's/\(ISODate(\)"\([0-9][-0-9:.TZ]*\)")/"\1\2)"/' -e 's/\(Timestamp.*\),/"\1",/'

maludwig commented 6 years ago

I also wanted this functionality, specifically to be able to be lenient enough to handle valid JS Objects.

My solution was this one-liner (needs NodeJS). Just in case you, fair reader, also have this problem.

echo 'console.log(JSON.stringify('`cat imperfect.json `',null,2));' | node | jq '.'

jqlang / jq

Add option for forgiving JSON parser / multiple parser backends #174