rickardp / splitstream

Continuous object splitter for C and Python
Apache License 2.0
44 stars 9 forks source link

JSON scalar texts in the stream are not captured properly #3

Open 00dani opened 8 years ago

00dani commented 8 years ago

It is valid for a JSON text to represent only a single scalar value, rather than an object or array - this is supported by Python's json module:

>>> import json
>>> json.loads('true')
True
>>> json.loads('false')
False
>>> json.loads('"an example"')
'an example'

However, a stream containing such texts will not be split correctly by splitstream. The keywords true, false, and null are silently dropped, as are numeric literals:

>>> import io; from splitstream import splitfile
>>> split_buf = lambda data: list(splitfile(io.BytesIO(data), format='json'))
>>> split_buf(b'true false null [5] null true false true false {"a": 6}')
[b'[5]', b'{"a": 6}']
>>> split_buf(b'4 5 6 7 []')
[b'[]']

Attempting to insert a string literal will cause different, still incorrect behaviour. If there are no objects or arrays in the stream, the text is still silently dropped; however, if there is an object or array occurring somewhere after the string, the entire stream up to that object or array will be captured as one buffer.

>>> split_buf(b'"abc" 56 "def"')
[]
>>> split_buf(b'"abc" 56 "def" {} 3 4')
[b'"abc" 56 "def" {}']
>>> split_buf(b'"abc" 56 "def" {} 3 4 "5" 6 7 []')
[b'"abc" 56 "def" {}', b' 3 4 "5" 6 7 []']

Attempting to parse these buffers with json.loads, naturally, does not work.

The correct behaviour would be to split the stream on every toplevel JSON value, producing separate buffers for each - in other words:

>>> fixed_split_buf(b'true false null 1 "hello world" ["goodbye", "world"] {"a": 12, "b": [null]}')
[b'true', b'false', b'null', b'1', b'"hello world"', b'["goodbye", "world"]', b'{"a": 12, "b": [null]}']
rickardp commented 8 years ago

I agree that ideally the stream should be split on every properly terminated (sub)document. This issue is currently documented as a known limitation (only arrays and objects are supported at the top level).

Currently the splitter uses a very basic lexer for splitting. There is unfortunately no really quick fix that maintains the current level of performance.

Pull requests are always welcome :)

amcgregor commented 2 years ago

@00dani I wrote a partial fragment tokenizer for Python statements and expressions as part of my template engine work on cinje. With one interesting edge case (null [5]) this appears to split as you would desire.

Admittedly, Python fragments, using Python's internal AST representation, not JSON, but. (Thus the oddity.) Conveying an approach, @rickardp 😉

>>> splitexpr('True False None [5] None True False True False {"a": 6}')
['True', 'False', 'None [5]', 'None', 'True', 'False', 'True', 'False', '{"a": 6}']

Edit to name this approach: "maximal syntactically valid substring matching". And to note that the JSON version (lower-case true, false, &c.) would also tokenize just fine. Those are valid Python symbols, even if they're not the correct ones for those singletons. Edit edit: this could implement parsing, too, by invoking literal_eval across the isolated fragments. JSON is valid Python, after all. 😜