Open 00dani opened 8 years ago
I agree that ideally the stream should be split on every properly terminated (sub)document. This issue is currently documented as a known limitation (only arrays and objects are supported at the top level).
Currently the splitter uses a very basic lexer for splitting. There is unfortunately no really quick fix that maintains the current level of performance.
Pull requests are always welcome :)
@00dani I wrote a partial fragment tokenizer for Python statements and expressions as part of my template engine work on cinje. With one interesting edge case (null [5]
) this appears to split as you would desire.
Admittedly, Python fragments, using Python's internal AST representation, not JSON, but. (Thus the oddity.) Conveying an approach, @rickardp 😉
>>> splitexpr('True False None [5] None True False True False {"a": 6}')
['True', 'False', 'None [5]', 'None', 'True', 'False', 'True', 'False', '{"a": 6}']
Edit to name this approach: "maximal syntactically valid substring matching". And to note that the JSON version (lower-case true
, false
, &c.) would also tokenize just fine. Those are valid Python symbols, even if they're not the correct ones for those singletons. Edit edit: this could implement parsing, too, by invoking literal_eval
across the isolated fragments. JSON is valid Python, after all. 😜
It is valid for a JSON text to represent only a single scalar value, rather than an object or array - this is supported by Python's
json
module:However, a stream containing such texts will not be split correctly by
splitstream
. The keywordstrue
,false
, andnull
are silently dropped, as are numeric literals:Attempting to insert a string literal will cause different, still incorrect behaviour. If there are no objects or arrays in the stream, the text is still silently dropped; however, if there is an object or array occurring somewhere after the string, the entire stream up to that object or array will be captured as one buffer.
Attempting to parse these buffers with
json.loads
, naturally, does not work.The correct behaviour would be to split the stream on every toplevel JSON value, producing separate buffers for each - in other words: