isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

New C extension backend #58

Closed rtobar closed 5 years ago

rtobar commented 8 years ago

This pull request is to add a new backend based on a C extension. The C extension exposes three generators: basic_parse, parse and items, which work the same way their counterparts in the other backends and in the common module work. Reference counting seems correct, creating neither memory corruptions nor memory leaks when parsing big amounts of data.

The main advantage of this new backend is its speed. Local tests indicate a gain of ~10x for the items generator, and even more for the parse and basic_parse generators, when compared with the cffi backend. This makes this backend comparable to the json module in terms of speed, while still giving users an iterative way of consuming a JSON stream, and therefore keeping memory consumption low.

I've re-organized the commits from their original form to make the pull request more compact and manageable. Commits include not only the new backend, but also their integration to the package installation procedure (optional, only if the compiler/libraries are found) and to the tests suite. I've also modified the travis configuration to use YAJL2 for testing, effectively testing more backends now (4 instead of 2).

rtobar commented 8 years ago

Hi Ivan,

Any chance to get this work reviewed and integrated?

isagalaev commented 8 years ago

Any chance to get this work reviewed and integrated?

Yes! Looks like great stuff. I'll look at it once I get around ijson again. Being ve-e-ery busy at the moment :-)

isagalaev commented 7 years ago

Hi Rodrigo! I have finally found some time to review this pull request. It is a great addition and I'm looking forward to committing it after we iron out a few things I mentioned above. Thanks a lot!

rtobar commented 7 years ago

Hi Ivan,

Thanks to you for taking the time to review these changes. I'm currently on holiday, and won't be back to the office until the last week of January, so I might or not be able to do something before then.

I responded to your observations above, which all seem very reasonable. I'm expecting small feedback only on one small point, the rest I'll implement as requested.

Thanks again!

isagalaev commented 7 years ago

Thanks for a quick reply! I'll try to be more responsive as well :-) Take your holiday time!

KenjiTakahashi commented 7 years ago

Hi, I have a need to iteratively parse very large JSON objects and I've decided to try this branch out for more speed. It works fine with Py2.7, but with Py3.6, I get a segfault. Here's a backtrace

(gdb) bt
#0  0x00007fbaf1320635 in ?? () from /usr/lib/libpython3.6m.so.1.0
#1  0x00007fbaf13b579e in PyArg_ParseTupleAndKeywords () from /usr/lib/libpython3.6m.so.1.0
#2  0x00007fbaeff881b3 in basicparse_init (self=0x7fbaf0476e30, args=<optimized out>, kwargs=<optimized out>) at ijson/backends/_yajl2.c:233
#3  0x00007fbaeff87cdf in parsegen_init (self=0x7fbaf0476e30, args=<optimized out>, kwargs=<optimized out>) at ijson/backends/_yajl2.c:447
#4  0x00007fbaeff87c09 in itemsgen_init (self=0x7fbaf0476e30, args=<optimized out>, kwargs=0x0) at ijson/backends/_yajl2.c:760
#5  0x00007fbaf14432ce in ?? () from /usr/lib/libpython3.6m.so.1.0
#6  0x00007fbaf145d85c in _PyObject_FastCallDict () from /usr/lib/libpython3.6m.so.1.0
#7  0x00007fbaf13e4396 in ?? () from /usr/lib/libpython3.6m.so.1.0
#8  0x00007fbaf13a5067 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.6m.so.1.0
#9  0x00007fbaf13e3d4a in ?? () from /usr/lib/libpython3.6m.so.1.0
#10 0x00007fbaf13e4303 in ?? () from /usr/lib/libpython3.6m.so.1.0
#11 0x00007fbaf13a5067 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.6m.so.1.0
#12 0x00007fbaf13e4757 in PyEval_EvalCodeEx () from /usr/lib/libpython3.6m.so.1.0
#13 0x00007fbaf13a4d4b in PyEval_EvalCode () from /usr/lib/libpython3.6m.so.1.0
#14 0x00007fbaf1486112 in ?? () from /usr/lib/libpython3.6m.so.1.0
#15 0x00007fbaf148897d in PyRun_FileExFlags () from /usr/lib/libpython3.6m.so.1.0
#16 0x00007fbaf1488b67 in PyRun_SimpleFileExFlags () from /usr/lib/libpython3.6m.so.1.0
#17 0x00007fbaf147ca91 in Py_Main () from /usr/lib/libpython3.6m.so.1.0
#18 0x0000000000400a5d in main ()
(gdb)

and a simple code that replicates it

from io import StringIO

import ijson.backends.yajl2_c as ijson

test = StringIO('[{"events": 1}]')

items = ijson.items(test, 'item.events')

Thanks for working on this, cheers!

rtobar commented 7 years ago

@KenjiTakahashi Python 3.6 is fairly new, it wasn't around at the time of submitting these changes, and thus it hadn't been tested until now. I'll definitely have a look at this new issue once I'm back at work in ~1.5 weeks; until then you can use Python 3.5 which should work fine.

rtobar commented 7 years ago

Hi @isagalaev,

I'm finally back at work!

After some reading I learned that the "All rights reserved" phrase has no legal effect at all nowadays, and thus I simply removed it from the two files I added in this pull request. The importing of the C module has been changed as agreed, and the additional printing from the tests has been removed as well.

I've also included a fix for the bug reported by @KenjiTakahashi, and extended the python versions used by travis to include 3.5 and 3.6.

KenjiTakahashi commented 7 years ago

@isagalaev any plans for getting this in?

rtobar commented 7 years ago

pinging again

isagalaev commented 7 years ago

Sorry, I'm trying to find time to merge it and release a new version. I thought I'd be able to do it during PyCon past weekend but didn't get around…

rtobar commented 7 years ago

Just the usual ping...

KenjiTakahashi commented 6 years ago

It's been some time again, hasn't it :-)?

wichert commented 5 years ago

@isagalaev Yearly ping - is there a chance you can take a look at this again and hopefully merge it?

isagalaev commented 5 years ago

Hey everyone. I lost track of things happening here a while ago and don't remember why I didn't merge it in the first place. It's also unlikely I'll find time to get back to it any time soon. As far as I understand, @rtobar fork is alive and well, so I recommend everyone just use it.

(I'll close the ticket to avoid confusion, but feel free to continue conversation if needed.)

wichert commented 5 years ago

@isagalaev Assuming @rtobar is interested, would you be willing to give him PyPi privileges to make new releases?

rtobar commented 5 years ago

Sure, I wouldn't mind on my side. My user in PyPI is rtobar too.

isagalaev commented 5 years ago

@rtobar hey! Added you on PyPI as an owner. I can't seem to remove myself, so feel free to either leave me there for precaution or take it over completely.

rtobar commented 5 years ago

Thanks Ivan! I already exercised this and uploaded a new 2.4 version of ijson to PyPI, including binary wheels for the main x64 linux/macos.

isagalaev commented 5 years ago

Yay! Glad not to be a bottleneck anymore! And thanks to @wichert for kicking me in the right direction.