isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

asyncio backend #45

Closed kxepal closed 4 years ago

kxepal commented 8 years ago

Recently @ethanfrey raised a question about using ijson with asyncio corountines. What was my surprise when I found asyncio.py module in ijson.backends package for 2.2 version downloaded from PyPi. There are no such file in the source repository for the same tag.

Unfortunately, that module doesn't implements any asyncio support and seems like just an outdated python backend, that accidentally leaked into pypi package.

However, the question is still actual. @isagalaev, are there any plans for asyncio backend?

isagalaev commented 8 years ago

That file has indeed been picked up from source by mistake, sorry! I meant to try and make an async backend but couldn't find time after all.

My main problem was inventing the right API for it. Back then Python didn't have an async iterator protocol, so the usual approach for item in ijson.item(..): wouldn't work as it's a pull parser and it access the source of bytes (i.e. network) whenever needed. My best idea was something like this:

while True:
    items = yield from ijson.get_items(...)
    if not items: # empty list
        break
    for item in items:
        ...

I didn't like the nesting of two loops, though…

As of Python 3.5 we have async for, so this question goes away and we can have a natural API. Implementation wise it probably means converting Lexer, basic_parse, parse and items to support async iterator interface.

Having said all that, I'm fairly sure I won't be able to do that myself any time soon, so I encourage anyone to have a go! I can only promise that I could do a peer-review, merge and release it as part of ijson.

kxepal commented 8 years ago

@isagalaev Suddenly, for 3.3/3.4 there are no better solution for async iteration: still while True with a sentinel guard or exception catch. With async for things started to be much more nicer.

As for API, I'm playing with around and would like to make it something like this one:

import asyncio
import aiohttp
from ijson.backends import asyncio as ijson

@asyncio.coroutine
def cgo():  # 3.3 and 3.4
    resp = yield from aiohttp.get('http://localhost:5984')
    items = ijson.items(resp.content)
    while True:
        try:
            item = yield from items.next()
        except StopIteration:  
            # suddenly, we cannot do "if item is None: break" since None is a legit value
            # and I'm not sure that using Ellipsis as a sentinel instead would be oblivious
            break
        print(item)

async ago():  # 3.5+
    resp = await aiohttp.get('http://localhost:5984')
    async for item in ijson.items(resp.content):
        print(item)

Behind the ijson.items call we hide an object with an interface that implements async-for protocol and also allows to be used for yield-from coroutines:

class AsyncIterable(object):

    def __init__(self, coro):
        self.coro = coro

    @asyncio.coroutine
    def __aiter__(self):
        return self

    @asyncio.coroutine
    def __anext__(self):
        try:
            rval = yield from self.next()
        except StopIteration:
            raise StopAsyncIteration
        else:
            return rval

    @asyncio.coroutine
    def next(self):
        ...

I'm still working on the prototype, but so far it looks promising.

The only problem I found is impossible to reuse the existed code in python backend module. While some bits could be extracted into own functions, most of them still heavy tided with f.read() calls behind. And the generators are not very friendly integrates with the asyncio.coroutines. To solve that, some serious refactoring would be need to do, but no ideas for now how to recompose that code in the right way.

The good news that I probably can let asyncio backend to use yajl* ones for parsing - there f.read() can be decoupled from parser with easy.

PR will follow later.

isagalaev commented 8 years ago

To be honest, I don't even think we need to support versions older than 3.5. Since we didn't have any support for async before we don't have any legacy code depending on it. So if we implement a pure async for API nothing is going to break. And for people living with 3.3 and 3.4 upgrade to 3.5 is a no-brainer and shouldn't break anything either. (I also had plans to retire 2.x support entirely at some point.)

The only problem I found is impossible to reuse the existed code in python backend module.

Yeah, I was thinking about that too. A good plan would be to just write the async version from scratch and then start thinking about reusing common code later, no need to wait for this to happen.

The good news that I probably can let asyncio backend to use yajl* ones for parsing - there f.read() can be decoupled from parser with easy.

We're lucky that yajl works on a buffer it gets from the caller, it doesn't do any I/O at all.

BTW, I'll mark as an assignee on this issue, so people know that someone's working on it and wouldn't duplicate the effort.

kxepal commented 8 years ago

Well, 3.4 support wouldn't change much except yield from vs await. Still there is no anext() function, so yield from coro.next() turns into ugly await coro.__anext__().

So this support is cheap while most of aio-libs are still supports 3.4 - those users are our clients as well.

ethanfrey commented 8 years ago

Hi, I have a working prototype, that is just lacking real-world testing. I started with ijson, then realized I couldn't use any of the code, since it was all generators, which don't play around well with corotuines. So, I started rewriting all I could, and manually unwrapping them into aiter and anext (and adding a next() method as a nice wrapper for anext).

Anyway, it is currently in it's own repo. I could repackage it as a submodule of ijson. It is based on the same code, just re-writen to work with asyncio.

Repo at https://github.com/ethanfrey/aiojson You can see the tests, which show the API at https://github.com/ethanfrey/aiojson/blob/master/aiojson/backends/tests/test_async.py. @with_reader() just runs the coroutine with a network stream containing the desired data, so I can test the async part.

I really want to unwrap the last two instances in backend using aiogen, to make it more efficient. @iaogen was just my cheap way to get the coroutines working something like generators, but it launches a separate coroutine for each

ethanfrey commented 8 years ago

@kxepal I think the implementation looks nice, and requires really minimal changes to the flow. The downside I see is so many coroutines.

I know they are light-weight, but I am not sure how light-weight. One coroutine for the lexer, one for the parse_value, which recurses into another for parse_object, which recurses into another for parse array, which recurses into another for parse_values, ... which is fed into the basic_parser, which is fed into the item parser. I guess a typical stream would use 6-10 coroutines depending on the complexity of the json being parsed. And many of these coroutines are dynamically created and destroyed. It could be that asyncio is build solidly and can handle this with minimal overhead, it could be that it creates much overhead. I think it would be nice to see some performance results.

I went for the manually unrolling of the generators, which is somewhat more difficult and makes the code a little less clear, but allows the whole parsing to be in one coroutine. Maybe I'm just carrying prejudices from the threaded world with me about launching too many processes. I think your code is more readable, just wonder how it works for efficiency. I really haven't looked at the internals of asyncio, so I may be wrong here with my assumptions.

kxepal commented 8 years ago

@ethanfrey Basically, you're right: coroutines aren't cheap.

However, if we need performance then yalj backend have to be reused - I'll add this feature a bit later today. Then we'll have ~4 coroutines: file reader, basic parse, parse, items. May be one-two more, I think. This will solve all the performance questions and the IO will the bottleneck as it usually have to be.

So I think there is no need to worry about pure Python implementation while we can significantly improve it performance with less blood.

isagalaev commented 8 years ago

A quick note on performance. It only makes sense to talk about performance of the pure python backend under PyPy and it's impossible to tell which approach will be faster there, you have to measure. So I'd say we should optimize for readability first, not performance, as is usually the choice with Python.

ethanfrey commented 8 years ago

@kxepal Nice article about coroutines, nice to see some measurements that confirm/disprove hunches.

I found some time to take a deeper look at this code and am actually quite impressed. The only minor comment is the compatibility with "async for". It looks like using the AsyncIterable it will work for the various objects, but I think BasicParser doesn't raise it (but rarely will be used directly, so really a minor issue). I will try to add some python3.5 tests for "async for" and add them to the pull request if you don't mind.

kxepal commented 8 years ago

@ethanfrey That's a good point about BasicParser. While it wouldn't be used directly, still there was left an ugly hack in Parser to raise StopAsyncIteration when all the data had been read. So problem exists in design, but shouldn't affect the end user. I'll fix that anyway, thanks!

carver commented 6 years ago

@isagalaev @kxepal Any updates on this? #46 seems to add this support, and @luigiberrettini was able to merge it in another fork. I'd love to use this in an async context. (Py 3.5+)

kxepal commented 6 years ago

If there is need in any change requests I'm ready to handle them, but the final decision is up to @isagalaev .

So far it works good enought and ready for merge.

isagalaev commented 6 years ago

Hey guys… I am currently away from all of my open source projects and don't have any estimate on when I get back. So feel free to use forks, you don't have to rely on me being available.

kxepal commented 6 years ago

Hi @isagalaev !

It's a pity to hear that. Hope you'll get back to opensource soon! But until then, to not divide ijson community into forks, how do you feel to turn ijson into GH org, assign some co-maintainers and let them help to keep ijson project up to date? Doing merges, releases and everything else? Certainly, you'll have veto voice as the owner. I could help with maintenance since I'm interested in ijson project (and asyncio support), but I guess there are the others who can help as well. Personally, I really don't want to fork and maintain the fork since this is counter productive. Better to consolidate community around the single project to make it perfect...or good enough to serve it goals.

What do you think about all of this?

cowbert commented 6 years ago

Just fork it already. That's one of the core principles of open source and tools like git (originally designed for forking linux kernels) and github make it easy for that reason. You can even submit the fork to pypi after proving you can maintain it. There is nothing counterproductive about forking if you keep your fork pushes relatively pull request compatible (again, github makes this easy because you just submit pull requests and it will lint them for you).

isagalaev commented 4 years ago

The project has moved over to a new maintainer: https://github.com/rtobar/ijson. Please reopen this issue there if it's still relevant.