Open 91e69f45-91d9-4b12-87db-a02908296c81 opened 4 years ago
This is a well-explored issue in other contexts: https://en.wikipedia.org/wiki/JSON_streaming
There is also a patch for it in json.tool, for release in 3.9: https://bugs.python.org/issue31553
Basically it's often convenient to have a file containing a list of json docs, one per line. However, there is no convenient way to read them back in one by one, since json.load(filehandle) barfs when it sees the unexpected newline at the end of the first doc.
It would be great if the json module itself had a function to handle this. I have an awful hack that I use myself, that is not suitable for a production library, but I'll attach it to show what functionality I'm suggesting. I hope this is simple enough to not need a PEP. Thanks!
Note: the function in my attached file wants no separation at all between the json docs (rather than a newline between them), but that was ok for the application I wrote it for some time back. I forgot about that when first writing this rfe so thought I better clarify.
If you want to read json objects encoded one per line (JSON Lines or NDJSON), you can do this with just two lines of code:
for line in file:
yield json.loads(line)
This format is not formally standardized, but it is popular because its support in any programming language is trivial.
If you want to use more complex format, I afraid it is not popular enough to be supported in the stdlib. You can try to search third-party library which supports your flavour of multi-object JSON format or write your own code if this format is specific for your application.
It's coming back to me, I think I used the no-separator format because I made the multi-document input files by using json.dump after opening the file in append mode. That seems pretty natural. I figured the wikipedia article and the json.tool patch just released were evidence that there is interest in this. The approach of writing newlines between the docs and iterating through lines is probably workable though. I don't know why I didn't do that before. I might not have been sure that json docs never contain newlines.
Really it would be nice if json.load could read in anything that json.dump could write out (including with the indent parameter), but that's potentially more complicated and might conflict with the json spec.
Also I didn't know about ndjson (I just looked at it, ndjson.org) but its existence and formalization is even more evidence that this is useful. I'll check what the two different python modules linked from that site do that's different from your example of iterating through the file by lines.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', 'library']
title = 'JSON streaming'
updated_at =
user = 'https://bugs.python.org/phr'
```
bugs.python.org fields:
```python
activity =
actor = 'phr'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'phr'
dependencies = []
files = ['49153', '49154']
hgrepos = []
issue_num = 40623
keywords = []
message_count = 5.0
messages = ['368823', '368824', '368826', '368827', '368828']
nosy_count = 2.0
nosy_names = ['phr', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue40623'
versions = []
```