isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

Allow to differentiate invalid and incomplete JSON #29

Closed oprypin closed 9 years ago

oprypin commented 9 years ago

There is a difference between JSON that is definitely invalid and JSON that is invalid, but may be made correct by completing it.

And that's the information I need to get in my project and the only reason that I may want to use an iterative JSON parser.

Examples of invalid JSON:

[][                :
{"":"",[           a
{"":"",["a"]}      {""}
}                  {}",:
[{]}               [a;
[]{                {"a","b"}
[{}"],             {"":"",["a"]}

Examples of (potentially) valid JSON:

{                  {"a":"b"}
{"a":"b            ["
["]                ""
{"a":"b"           "z
{",":              {"a":"b}
{"                 [
"                  [{"{"

I see that you removed just the feature that I needed in commit https://github.com/isagalaev/ijson/commit/e079cc21aaaf9d867c89897b69580687ba5e4283, which is not nice. There definitely was no harm in having it.

Here is what I have to deal with now:

def json_still_valid(js):
    try:
        list(ijson.parse(io.BytesIO(js)))
    except ijson.JSONError as e:
        if str(e) == "Incomplete JSON data":
            pass
        elif str(e) == "Incomplete string lexeme":
            try:
                # See if adding a quote would fix it
                list(ijson.parse(io.BytesIO(js+b'"')))
            except ijson.JSONError as e:
                return False
        else:
            return False
    return True

This code seems to work without flaws, I've tested it quite extensively on the above examples and much more, but obviously it's inelegant and relies on text message of exceptions.

Using the old version the code goes like this:

def json_still_valid(js):
    try:
        list(ijson.parse(io.BytesIO(js)))
    except ijson.IncompleteJSONError:
        pass
    except ijson.JSONError:
        return False
    return True

But it gives some misinformation, specifically when a string literal is involved (hence the workaround in previous snippet):

[{}"],
{}",:

These things are not Incomplete or empty JSON data, they're definitely wrong.

I would be thankful if you added this back, taking this detail into account.


I also made a comparably cringy solution with json from standard library:

def json_still_valid(js):
    try:
        json.loads(js)
    except ValueError as e:
        msg = str(e)
        if msg.startswith("Expecting"):
            # Expecting value: line 1 column 4 (char 3)
            n = int(msg.rstrip(')').split()[-1])
            # If the error is "outside" the string, it can still be valid
            return n >= len(js)
        elif msg.startswith("Unterminated"):
            # Unterminated string starting at: line 1 column 1 (char 0)
            return True
        return False
    return True
isagalaev commented 9 years ago

Hey there! Sorry for taking so long to reply… I get your point and I'll try to reintroduce the InompleteJSONError back into the code. The problem is, I didn't remove it for no reason, there was something wrong with it in one of the backends (can't remember all the details right now). So it would take a bit more effort than simply reverting that change. I'll get to it as time allows.

Alternatively, you could have a go at it yourself. Just make sure you add specific tests for it. It might also be that yajl1 simply doesn't do it properly, in which case it should just raise the generic JSONError and be excluded from tests for incompleteness.

oprypin commented 9 years ago

:+1: Thanks

isagalaev commented 9 years ago

Oh, since you're so conveniently online, could you also test it on your data? I'm going to push a new release shortly so it'd be nice to have some extra testing. Thanks!

oprypin commented 9 years ago

But I have provided many tests in my original post. Just add those.

My original project was postponed, and there is no real unusual data to speak about, just incomplete JSON strings.

isagalaev commented 9 years ago

Ah… Indeed :-)

isagalaev commented 9 years ago

btw, a couple of those are perfectly valid and complete:

{"a":"b"}
""
oprypin commented 9 years ago

(potentially) valid

includes valid.

But yes, sorry for confusion