isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

CFFI Instead of Ctypes #35

Closed Dav1dde closed 9 years ago

Dav1dde commented 9 years ago

So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, the PyPy needed ~1:30-2:00 where as Python 2.7 needed ~13 seconds (the pure python implementation was close at ~8 minutes).

Apparantly ctypes is really bad performance wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d

Before:

Python 2.7: python -m emfas.server size dumps/echoprint-dump-1.json  11.89s user 0.36s system 98% cpu 12.390 total
PYPY: python -m emfas.server size dumps/echoprint-dump-1.json  117.19s user 2.36s system 99% cpu 1:59.95 total

After (CFFI):

Python 2.7: python jsonsize.py ../dumps/echoprint-dump-1.json  8.63s user 0.28s system 99% cpu 8.945 total
PyPy: python jsonsize.py ../dumps/echoprint-dump-1.json  4.04s user 0.34s system 99% cpu 4.392 total

Maybe it would make sense to add an additional CFFI backend which gets chosen over ctypes if CFFI is available.


Testcode:

import sys

_IGNORED_SIZE_EVENTS = ('end_map', 'end_array', 'map_key')

def size(ijson, path):
    s = 0
    with open(path) as f:
        events = ijson.parse(f)

        for space, event, data in events:
            if space == 'item' and event not in _IGNORED_SIZE_EVENTS:
                s += 1

    return s

def main():
    # from ijson.backends import yajl2 as ijson
    import cffibackend

    path = sys.argv[1]
    print size(cffibackend, path)

if __name__ == '__main__':
    main()
isagalaev commented 9 years ago

Wow, this definitely looks cool! Thanks for investigating it. To be frank all the C/Python interop is an unexplored territory for me, so if you could fork it and finish CFFI backends that'd be awesome!

Dav1dde commented 9 years ago

I am currently working on a pull-request, I want to restructure the yajl backends, so it will automatically load in this order: yajl2-cffi, yajl2-ctypes, yajl1-cffi, yajl1-ctypes. Is there any benefit to let the user choose which backend they want?

Reading the patchnotes yajl2 has a 20%-30% speedboost in comparison to yajl1, so is there any reason you want to choose yajl1 over yajl2? Then from my small testing it looks like cffi is always faster than ctypes (not to mention the huge gain on pypy), so cffi should be prefered?

One more thing, I would actually change ijson.__init__ to load the fastest backend available, in this order: yajl2-cffi, yajl2-ctypes, yajl1-cffi, yajl1-ctypes, python.

What are your thoughts on this?

soundofjw commented 9 years ago

+1 for @Dav1dde's thoughts here. We're using ijson to iterate some very large elastic search aggregations, and the importance of using the yajl2 backend was initially missed on me.

:beers:

isagalaev commented 9 years ago

Sorry for letting this hang for so long…

The automatic selection of the fastest backend was removed in 96defaf to fix #22. Basically, it seems impractical to test all the combinations of backends and environments for weird bugs in the selection algorithm and since it runs unconditionally we risk making the library unusable in this case. I think the reasoning still stands, even though I understand that one might not have a notion to still read through README when the Python backend just works out of the box. I don't know a good way around this yet.

I'm going to try and fix the tests in the CFFI branch and merge it.