isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

Support yajl2 backend #3

Closed selik closed 12 years ago

selik commented 12 years ago

I get an AttributeError when parsing a json file from GitHub Archive.

I installed yajl with Homebrew. It looks like it uses Yajl version 2.0.4. I installed ijson from source hosted here on GitHub with python setup.py install. This appears to be version 0.8.0. I'm using Python version 2.7.3, running iPython installed from source a couple days ago.

Code

import gzip
import ijson

with gzip.open('/Users/mike/src/gads/data/GitHubArchive/2012-03-11-15.json.gz') as g:
    for item in ijson.items(g.read(), 'repository'):
        print(item)
        break

Output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-cfc9bcb7679a> in <module>()
      1 import gzip
----> 2 import ijson
      3 
      4 with gzip.open('/Users/mike/src/gads/data/github/2012-03-11-15.json.gz') as g:
      5     for item in ijson.items(g.read(), 'repository'):

/Users/mike/venv/gads/lib/python2.7/site-packages/ijson/__init__.py in <module>()
----> 1 from ijson.parse import JSONError, IncompleteJSONError, \
      2                         basic_parse, parse, \
      3                         ObjectBuilder, items

/Users/mike/venv/gads/lib/python2.7/site-packages/ijson/parse.py in <module>()
      3 from decimal import Decimal
      4 
----> 5 from ijson.lib import yajl
      6 
      7 C_EMPTY = CFUNCTYPE(c_int, c_void_p)

/Users/mike/venv/gads/lib/python2.7/site-packages/ijson/lib.py in <module>()
     17 yajl.yajl_alloc.restype = POINTER(c_char)
     18 yajl.yajl_gen_alloc.restype = POINTER(c_char)
---> 19 yajl.yajl_gen_alloc2.restype = POINTER(c_char)
     20 yajl.yajl_get_error.restype = POINTER(c_char)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ctypes/__init__.pyc in __getattr__(self, name)
    376         if name.startswith('__') and name.endswith('__'):
    377             raise AttributeError(name)
--> 378         func = self.__getitem__(name)
    379         setattr(self, name, func)
    380         return func

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ctypes/__init__.pyc in __getitem__(self, name_or_ordinal)
    381 
    382     def __getitem__(self, name_or_ordinal):
--> 383         func = self._FuncPtr((name_or_ordinal, self))
    384         if not isinstance(name_or_ordinal, (int, long)):
    385             func.__name__ = name_or_ordinal

AttributeError: dlsym(0x10253eec0, yajl_gen_alloc2): symbol not found

If I comment out line 19 in lib.py

-19 yajl.yajl_gen_alloc2.restype = POINTER(c_char)
+19 # yajl.yajl_gen_alloc2.restype = POINTER(c_char)

Then the same code causes the ipython notebook to tell me the kernel crashed and when run in ipython console gives a segmentation fault.

In [1]: import gzip
In [2]: import ijson
In [3]: with gzip.open('/Users/mike/src/gads/data/github/2012-03-11-15.json.gz') as g:
   ...:         for item in ijson.items(g.read(), 'repository.item'):
   ...:                 print(item)
   ...:                 break
   ...:     
Segmentation fault: 11

The seg fault makes sense if setting the type of yajl_gen_alloc2 did something useful

Just in case, is this possibly related to the way GitHub Archive does not delimit its objects (https://github.com/igrigorik/githubarchive.org/issues/9)?

selik commented 12 years ago

http://forums.mudlet.org/viewtopic.php?f=7&t=2183

"On further inspection it appears that these functions are no longer defined in yajl 2.0. I think yajl_parse_complete was just renamed to yajl_complete_parse, while yajl_gen_alloc2 was removed, with the equivalent functionality provided elsewhere."

isagalaev commented 12 years ago

That's a bummer :-(. I want to keep ijson working out of the box on recent(-ish) Ubuntu and it still has yajl 1.x in repositories. I'm not sure what's the best way to move forward here. I could probably make a yajl2-specific branch for that. The problem with non-master branches is that they're pretty much undiscoverable for users

BTW, there's already a branch "buffered" with pure python parsing implementation that doesn't depend on yajl. It's obviously slower under CPython but is on par with yajl-powered code under PyPy.

selik commented 12 years ago

The "buffered" branch works just fine, though it also refers to README.txt in the setup script. You're right that non-master branches get ignored most of the time.

Non-sequitur: why not provide a function that iterates through all top-level objects in the JSON file? The function items(file, prefix) requires the user to know the structure of the JSON file, which isn't always true. Also, how would one handle an item that doesn't have a prefix (ex: https://gist.github.com/2017462)?

isagalaev commented 12 years ago

Just specify an empty string as a prefix for the top-level object (and it's always only a single one). So, basically, this is the equivalent of the usual "load everything" function:

def load(f):
    return list(ijson.items(f, ''))[0]
selik commented 12 years ago

Perhaps it would be good to set an empty string as the default for the prefix parameter?

def items(file, prefix = ''):
    ...
isagalaev commented 12 years ago

Nah… I don't think so. Default parameters are good for common use-cases, whereas this one is actually the exact opposite of the iterative parsing that ijson is good for, it just parses the whole document as any other parsers. What might be useful though is to have functions load() and loads() as a compatibility interface with the standard json module.

isagalaev commented 12 years ago

A quick update. Another person (@sashka) wanted to use ijson with yajl2 so we decided to refactor ijson to support several backends and he volunteered to make a backend for yajl2. The backend structure is already in place so you can already switch back to the master branch and use import ijson.backends.python as ijson to use the pure python backend.

isagalaev commented 12 years ago

Implemented yajl2 backend and guessing logic that finds whatever version of yajl is available in the system falling back to pure Python backend if neither is found. So all of these should always work in the system with yajl2 installed:

import ijson
import ijson.backends.yajl2 as ijson
import ijson.backends.python as ijson