tomnomnom / gron

Make JSON greppable!
MIT License
13.87k stars 328 forks source link

Better handling of big json files #21

Closed srwilson closed 8 years ago

srwilson commented 8 years ago

Currently running gron on large json files is very slow. For example a 40MB file takes over a minute:

> time gron big.json > foo

real    1m28.850s
user    1m37.038s
sys 0m2.333s

My guess is it's in the sorting phase. Would it possible to avoid sorting all together? Maybe doing a streaming decode of the json would be helpful too.

At the very least it should be possible to disable sorting via command line option.

tomnomnom commented 8 years ago

That sounds like a reasonable guess. I'll do some profiling and see what crops up.

Would you be able to share the source of your big JSON file so I can get a reasonable comparison?

srwilson commented 8 years ago

Can't share the one I originally ran but here's a python script I made to create a file

import json
d = {"a": "a", "b": "b", "c": "c"}
dd = [d]*1000000
print json.dumps({"data": dd})

That makes a 31MB file that did even worse:

real    2m44.392s
user    2m58.819s
sys 0m5.225s
tomnomnom commented 8 years ago

Great, thanks!

tomnomnom commented 8 years ago

@srwilson I'm not done yet, but I've made some changes in 2e2114bc that should help you a bit.

There's a couple of really minor speedups here and there, but the two main things are:

  1. The sorting no-longer bothers stripping color codes from the statements if you're using --monochrome
  2. There's a --no-sort option that, somewhat predictably, disables sorting

Using a JSON file generated from your python script:

tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron testdata/verybig.json > /dev/null

real    2m23.393s
user    2m33.124s
sys 0m0.932s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --monochrome testdata/verybig.json > /dev/null

real    0m35.218s
user    0m37.208s
sys 0m0.680s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort testdata/verybig.json > /dev/null

real    0m13.636s
user    0m15.208s
sys 0m0.680s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort --monochrome testdata/verybig.json > /dev/null

real    0m8.768s
user    0m10.148s
sys 0m0.632s

The --no-sort is the major win, but things are still more acceptable than they were when sorting the output when using --monochrome. There's a little extra that can be done there too: the --monochrome flag could be forced when the output isn't a TTY, rather than having to manually specify it.

I've had a think about using a streaming JSON parser, but you could only use it when --no-sort is in use, and it might make for significant complications in other parts of the code. I might do a bit of a POC to see how bad it would be though.

There's still more to be done to make things better so I'm not going to close this issue right now.

In the meantime I've tagged and released what I've done so far as 0.3.4

Thanks again!

tomnomnom commented 8 years ago

@srwilson nothing's tagged yet, but I thought you might be interested to know I've made some pretty big changes to gron's inner workings to make the sort more efficient (ec6e312).

The outcome is that worst-case performance (colors and sorting enabled) is now around 5 times better.

The slightly unfortunate thing is that the best-case performance (monochrome, no sorting) is slightly worse - mostly because of an increased number of allocations. Thankfully the massive refactor opens up new avenues for meaningful optimisation now that the sorting doesn't dominate quite so much.

Here's the same tests from above repeated with a build from master:

tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron ~/tmp/big.json > /dev/null

real    0m28.844s
user    0m34.744s
sys 0m1.204s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --monochrome ~/tmp/big.json > /dev/null

real    0m22.123s
user    0m27.708s
sys 0m1.084s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort ~/tmp/big.json > /dev/null

real    0m18.683s
user    0m24.720s
sys 0m1.180s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort --monochrome ~/tmp/big.json > /dev/null

real    0m12.171s
user    0m17.404s
sys 0m1.072s
tomnomnom commented 8 years ago

A few commits later and I've made some more improvements. Removed some unnecessary copies and made the monochrome mode forced by the output not being a terminal:

tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron ~/tmp/big.json > /dev/null

real    0m15.914s
user    0m17.804s
sys 0m1.280s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort ~/tmp/big.json > /dev/null

real    0m8.471s
user    0m10.928s
sys 0m1.216s

That puts worst case when stdout is redirected at about 9 times better, and best case (i.e. with --no-sort) about 17 times better than when this issue was raised.

I'm going to consider the issue 'fixed', although I will continue to make things faster.

I've released all the changes as 0.3.6.

@srwilson thanks again for your input!