tomnomnom / gron

Make JSON greppable!
MIT License
13.73k stars 325 forks source link

Adds "decode all" option #92

Open rjp opened 2 years ago

rjp commented 2 years ago

Fixes #70 (implicitly), #23. May also have an impact on the "high memory usage" issues but I'm doing more testing there.

Adds: -a, -all flag which means "decode all the objects, pretending it's a JSON stream even if it's not actually."

Rationale: gron only decodes the first object, gron -s requires a "correctly" formatted JSON stream (one object per line), but it's not uncommon to get multiple objects per line with tools that don't support JSON stream formatting.

This does require a positionable stream, however, since the JSON decoder can read past the end of an object to be sure its parsed correctly. io.Seekable doesn't work, unfortunately, because whilst we know where we want to be (d.InputOffset()), we don't actually know where we currently are which precludes the use of io.SeekCurrent and, bizarrely, it turns out that io.SeekSet gets progressively slower as you seek further and further into your (in this case) bytes.Buffer.

Thus we keep track of where we want to be (moved) and create a bytes.NewReader for each attempted decode at the correct position. Crufty, definitely, and memory-allocation heavy, probably, but it works and is surprisingly not that bad even on large files.

My test 85MB JSON single line input takes ~64s (x86_64), ~43s (arm64) and ~275M to parse into 1024 objects comprising 1GB of output text. Compare to jq: ~25s (x86_64), ~11s (arm64) using ~630M giving 350MB of output.

milahu commented 2 years ago

decode all the objects

should be default what else do we need all the (non-option) argv for?

also jq takes multiple input files

jq [options...] filter [files...]

... like many other unix/gnu tools, so they play nice with xargs, for example

find . -name '*.json' -print0 | xargs -0 gron

evaluates to gron a.json b.json c.json d.json

Adds: -a, -all flag which means "decode all the objects, pretending it's a JSON stream even if it's not actually."

could be parsed as array of json documents, as suggested in https://github.com/tomnomnom/gron/issues/28#issuecomment-915170293

$ gron <( echo '{ "hello": "world" }' ) <( echo '{ "hello2": "world2" }' )
file = [];
file[0] = {};
file[0].hello = "world";
file[1] = {};
file[1].hello2 = "world2";

using filenames would look weird in this example

$ echo <( echo '{ "hello": "world" }' )
/dev/fd/63

... but filenames could be enabled with a -H option → #72 (or disabled with a -h option)

$ gron -H a.json b.json
file = {};
file["a.json"] = {};
file["a.json"].hello = "world";
file["b.json"] = {};
file["b.json"].hello = "world";
rjp commented 2 years ago

what else do we need all the (non-option) argv for?

Ah, this is "decode all the objects in the input", not "decode all the objects in the command line arguments", because I have things that output multiple objects in a single file non-stream format which I needed to decode.

But yes, iterating over the arguments does make sense if only for xargs usage.

milahu commented 2 years ago

oops, i confused this issue with #28

Adds: -a, -all flag which means "decode all the objects, pretending it's a JSON stream even if it's not actually."

now it makes sense to hide this feature behind a flag as {"a":1}{"b":2} is an invalid json document