Closed srwilson closed 8 years ago
That sounds like a reasonable guess. I'll do some profiling and see what crops up.
Would you be able to share the source of your big JSON file so I can get a reasonable comparison?
Can't share the one I originally ran but here's a python script I made to create a file
import json
d = {"a": "a", "b": "b", "c": "c"}
dd = [d]*1000000
print json.dumps({"data": dd})
That makes a 31MB file that did even worse:
real 2m44.392s
user 2m58.819s
sys 0m5.225s
Great, thanks!
@srwilson I'm not done yet, but I've made some changes in 2e2114bc that should help you a bit.
There's a couple of really minor speedups here and there, but the two main things are:
--monochrome
--no-sort
option that, somewhat predictably, disables sortingUsing a JSON file generated from your python script:
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron testdata/verybig.json > /dev/null
real 2m23.393s
user 2m33.124s
sys 0m0.932s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --monochrome testdata/verybig.json > /dev/null
real 0m35.218s
user 0m37.208s
sys 0m0.680s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort testdata/verybig.json > /dev/null
real 0m13.636s
user 0m15.208s
sys 0m0.680s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort --monochrome testdata/verybig.json > /dev/null
real 0m8.768s
user 0m10.148s
sys 0m0.632s
The --no-sort
is the major win, but things are still more acceptable than they were when sorting the output when using --monochrome
. There's a little extra that can be done there too: the --monochrome
flag could be forced when the output isn't a TTY, rather than having to manually specify it.
I've had a think about using a streaming JSON parser, but you could only use it when --no-sort
is in use, and it might make for significant complications in other parts of the code. I might do a bit of a POC to see how bad it would be though.
There's still more to be done to make things better so I'm not going to close this issue right now.
In the meantime I've tagged and released what I've done so far as 0.3.4
Thanks again!
@srwilson nothing's tagged yet, but I thought you might be interested to know I've made some pretty big changes to gron's inner workings to make the sort more efficient (ec6e312).
The outcome is that worst-case performance (colors and sorting enabled) is now around 5 times better.
The slightly unfortunate thing is that the best-case performance (monochrome, no sorting) is slightly worse - mostly because of an increased number of allocations. Thankfully the massive refactor opens up new avenues for meaningful optimisation now that the sorting doesn't dominate quite so much.
Here's the same tests from above repeated with a build from master
:
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron ~/tmp/big.json > /dev/null
real 0m28.844s
user 0m34.744s
sys 0m1.204s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --monochrome ~/tmp/big.json > /dev/null
real 0m22.123s
user 0m27.708s
sys 0m1.084s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort ~/tmp/big.json > /dev/null
real 0m18.683s
user 0m24.720s
sys 0m1.180s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort --monochrome ~/tmp/big.json > /dev/null
real 0m12.171s
user 0m17.404s
sys 0m1.072s
A few commits later and I've made some more improvements. Removed some unnecessary copies and made the monochrome mode forced by the output not being a terminal:
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron ~/tmp/big.json > /dev/null
real 0m15.914s
user 0m17.804s
sys 0m1.280s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort ~/tmp/big.json > /dev/null
real 0m8.471s
user 0m10.928s
sys 0m1.216s
That puts worst case when stdout is redirected at about 9 times better, and best case (i.e. with --no-sort
) about 17 times better than when this issue was raised.
I'm going to consider the issue 'fixed', although I will continue to make things faster.
I've released all the changes as 0.3.6.
@srwilson thanks again for your input!
Currently running gron on large json files is very slow. For example a 40MB file takes over a minute:
My guess is it's in the sorting phase. Would it possible to avoid sorting all together? Maybe doing a streaming decode of the json would be helpful too.
At the very least it should be possible to disable sorting via command line option.