isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

question regarding a stream of \n separated objects #42

Closed aaronkaplan closed 9 years ago

aaronkaplan commented 9 years ago

Hi @isagalaev

I have a \n separated list of JSON objects arriving via a stream.

Example:

{
  "timestamp": "2015-07-29T20:09:45.304101",
  "ip_str": "1.2.3.4"
}
{
  "timestamp": "2015-07-29T20:09:45.304101",
  "ip_str": "5.6.7.8"
}
...

I wrote a small "SAX" parser based on ijson and so far it performs nicely except for the \n end of objects. I get this error message:

Traceback (most recent call last):
  File "insert_fingerprints.py", line 71, in <module>
    for prefix, event, value in parser:
  File "/usr/local/lib/python2.7/dist-packages/ijson/common.py", line 65, in parse
    for event, value in basic_events:
  File "/usr/local/lib/python2.7/dist-packages/ijson/backends/yajl2.py", line 96, in basic_parse
    raise exception(error.decode('utf-8'))
ijson.common.IncompleteJSONError: parse error: trailing garbage
          01",   "ip_str": "1.2.3.4" } {   "timestamp": "2015-07-29T20
                     (right here) ------^

So, I assume that ijson will not parse this non-100% JSON format. However, most streaming JSON files are formated exactly like this: \n separated objects (with or without semicolons between them). If I put everything into an array , it works fine. But many services just send the \n separated JSON objects.

I think ijson sould be tolerant to this \n separated format.

What's needed in your opinion to allow this as well?

For reference see: https://en.wikipedia.org/wiki/JSON_Streaming#Line_delimited_JSON

I guess YAIL actually supports newline delimited JSON. At least it supports "//" comments in JSON.

isagalaev commented 9 years ago

No, I don't think extending the JSON parser itself to support the variant you're proposing would be a good idea.

From the design point of view it's simply too arbitrary and it lacks any formal specifications. I'm not so sure that it's how "most" of JSON streams are constructed. Note that the Wikipedia article that you're referring to actually describes a somewhat different format: the objects themselves don't contain literal line breaks. This is important because this way you can easily split your incoming stream by line breaks yourself and then parse each line with a stock JSON parser.

From the implementation point of view, I simply can't do anything about the fact that yajl doesn't parse this format. It's a third-party library, not my own code. The fact that it does support line comments has nothing to do with that, it's just something that the author of yajl decided to do.

One practical approach for you might be taking a Python backend (ijson.backends.python) and making your own variant of the basic_parse() function. It has an explicit check for additional data which you can leave out and then just run it in a loop. You will see that what it does is simply exhaust the parse_value() iterator which actually stops after parsing one value of any type.

aaronkaplan commented 9 years ago

okay, thx for the analysis. I did definitely encounter many \n separated streaming JSON files. However, if indeed YAJL can't handle that, then of course it's a different matter. I'll take a look at the basic_parse() function.

BTW: here is an example of \n separated JSON: https://www.censys.io/data/22-ssh-banner-full_ipv4 (for that, I need some SAX libary ;-) doesn't fit into RAM )

ozmium commented 6 years ago

Related/similar issue: #40