JSON streaming - Githubissues

91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

BPO	40623
Nosy	@serhiy-storchaka
Files	jsonstream.py: json stream reading function jsonstream.py: same as above but with explanatory comment added

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library'] title = 'JSON streaming' updated_at = user = 'https://bugs.python.org/phr' ``` bugs.python.org fields: ```python activity = actor = 'phr' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'phr' dependencies = [] files = ['49153', '49154'] hgrepos = [] issue_num = 40623 keywords = [] message_count = 5.0 messages = ['368823', '368824', '368826', '368827', '368828'] nosy_count = 2.0 nosy_names = ['phr', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue40623' versions = [] ```

91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

This is a well-explored issue in other contexts: https://en.wikipedia.org/wiki/JSON_streaming

There is also a patch for it in json.tool, for release in 3.9: https://bugs.python.org/issue31553

Basically it's often convenient to have a file containing a list of json docs, one per line. However, there is no convenient way to read them back in one by one, since json.load(filehandle) barfs when it sees the unexpected newline at the end of the first doc.

It would be great if the json module itself had a function to handle this. I have an awful hack that I use myself, that is not suitable for a production library, but I'll attach it to show what functionality I'm suggesting. I hope this is simple enough to not need a PEP. Thanks!

91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

Note: the function in my attached file wants no separation at all between the json docs (rather than a newline between them), but that was ok for the application I wrote it for some time back. I forgot about that when first writing this rfe so thought I better clarify.

serhiy-storchaka commented 4 years ago

If you want to read json objects encoded one per line (JSON Lines or NDJSON), you can do this with just two lines of code:

    for line in file:
        yield json.loads(line)

This format is not formally standardized, but it is popular because its support in any programming language is trivial.

If you want to use more complex format, I afraid it is not popular enough to be supported in the stdlib. You can try to search third-party library which supports your flavour of multi-object JSON format or write your own code if this format is specific for your application.

91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

It's coming back to me, I think I used the no-separator format because I made the multi-document input files by using json.dump after opening the file in append mode. That seems pretty natural. I figured the wikipedia article and the json.tool patch just released were evidence that there is interest in this. The approach of writing newlines between the docs and iterating through lines is probably workable though. I don't know why I didn't do that before. I might not have been sure that json docs never contain newlines.

Really it would be nice if json.load could read in anything that json.dump could write out (including with the indent parameter), but that's potentially more complicated and might conflict with the json spec.

91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

Also I didn't know about ndjson (I just looked at it, ndjson.org) but its existence and formalization is even more evidence that this is useful. I'll check what the two different python modules linked from that site do that's different from your example of iterating through the file by lines.

python / cpython

JSON streaming #84803