isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

Fails on trivial JSON in Python 3.4 #28

Closed jimklo closed 9 years ago

jimklo commented 9 years ago

Not quite sure how to debug, but tried installing from github master. Python 3.4.2 w/ libyajl 2.1.0 on OS X 10.10

import ijson.backends.yajl2 as ijson
import codecs

class FHoseFile(object):
  def __init__(self, filename, *parms, **kw):
    self.filename = filename

  def iter(self):
    with codecs.open(self.filename, 'r', encoding='utf8') as rawjson:
      objs = ijson.items(rawjson, "item")
      for o in objs:
        yield o

filename = "./sample2.json"

hf = FHoseFile(filename)
for tweet in hf.iter():
  print(tweet)

sample2.json contains

[{
  "foo":"bar"
}]

when I execute my script I get the following error:

(twit3)evil-jim-klo:src jklo$ ./run.py
Traceback (most recent call last):
  File "./run.py", line 20, in <module>
    for tweet in hf.iter():
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/src/firehose/__init__.py", line 15, in iter
    for o in objs:
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/common.py", line 131, in items
    current, event, value = next(prefixed_events)
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/common.py", line 58, in parse
    for event, value in basic_events:
  File "/Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/backends/yajl2.py", line 95, in basic_parse
    raise common.JSONError(error.decode('utf-8'))
ijson.common.JSONError: lexical error: invalid char in json text.
                                      [                     (right here) ------^

I stuck pdb on line 95 in yajl2.py, this is what I discovered:

(twit3)evil-jim-klo:src jklo$ ./run.py
> /Users/jklo/projects/Sunflower/EmergingEvents/twitterDemo/twit3/lib/python3.4/site-packages/ijson/backends/yajl2.py(96)basic_parse()
-> raise common.JSONError(error.decode('utf-8'))
(Pdb) buffer
'[{\n  "foo": "bar"\n}]'
(Pdb) error
b'lexical error: invalid char in json text.\n                                      [                     (right here) ------^\n'
(Pdb)

From the surface it looks like \n isn't getting ignored or stripped, however your Travis tests for Python 3.4 seem to be passing.

FWIW switching from codecs.open to open makes no difference, same error.

If I use Python 2 it all works, however my script has some Python 3 dependencies.

isagalaev commented 9 years ago

From the top of my head, I'd say it's because you're feeding it a decoded unicode stream instead of raw bytes. Try to give it a simple open(self.filename, 'rb').

(I'm sounding uncertain because I never actually tried to give a unicode buffer to YAJL as an input :-) )

jimklo commented 9 years ago

Yep. That was it. Thanks.

It's odd open(filename) just works in P2 but not in P3. Defaults must have changed.

isagalaev commented 9 years ago

The default mode of open() in both Py2 and Py3 is text. However in Py2 text is represented by byte strings and that's what you get from .read(), no matter in which mode the file was opened. The mode in Py2 pretty much only controls how line breaks are treated on different platforms. Py3 switched to representing text with unicode and now you bytes type from open('...', 'rb') and str type (which is unicode) from open('...', 'r'). In other words, in Py3 the mode actually mean what it says :-).